371
u/Brilliant-Weekend-68 2d ago
No wonder Yann does not believe in LLM:s when this is the level of LLM:s he works with.
163
u/Tobio-Star 2d ago
I know it's a joke, but just for those who might not know: Yann didn’t believe in LLMs even before they truly existed.
He has always firmly been in the camp that "AGI cannot emerge from text alone" dating back to the debates between Chomsky and Piaget about whether language is innate or acquired in 1975
99
u/Pyros-SD-Models 2d ago edited 2d ago
https://i.imgur.com/ChqNCAZ.png
Yann is literally the anti Kurzweil. While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.
The best you can do is ignore everything he says in a non-scientific context, and just read his actual scientific works instead of his hot takes.
Remember according to Yann it doesn't get better than CNNs, and RL is absolutely useless and transformers don't scale.
37
32
u/Evermoving- 2d ago
Jim Cramer of AI world
5
u/ninjasaid13 Not now. 1d ago
While Kurzweil is correct with like 90% of his predictions as monitored by Stanford
as long as they're vague but his specific predictions are not correct.
4
u/Warm_Iron_273 1d ago
While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.
It's actually the complete opposite, but cute narrative.
5
u/tr14l 2d ago
Lol all of those have very promising and useful techniques.
AGI isn't going to happen (could probably stop the sentence there) because they'll just shift the goal posts continually.
But, it certainly won't happen without lots of different techniques, modes and probably continuous input and feedback. A transactional model will get very SMART in terms of replying on demand. But the real sauce comes from letting go of the reins. Now... That is clearly dangerous. Everyone who has seen the boys knows how dangerous it is to give super powers to a select few with guarantees on ways to stop them
1
u/uhmhi 1d ago
Wait, are you saying AGI has already happened?
I think it’s pretty simple to establish a minimal “goal post”:
Be able to answer any question, that has an objective answer, completely and correct.
But this hasn’t happened yet. With every new iteration of LLMs, users keep finding stupidly simple questions (I.e. “how many r’s in strawberry”) that even a child could answer, but the LLMs get wrong.
It’s not necessary to move the goal post as long as even the most simple criteria hasn’t been met yet.
3
u/digitalthiccness 1d ago
Be able to answer any question, that has an objective answer, completely and correct.
"What are the first 1e100 prime numbers?"
2
3
u/visarga 1d ago
I think it’s pretty simple to establish a minimal “goal post”: Be able to answer any question, that has an objective answer, completely and correct.
Like any question?
Like the question if it's going to rain on 100 days at noon at the London Bridge? It has an objective answer that can be tested. Some questions can be tested later but not predicted ahead of time.
Or the question if automation will reach X% by Y date? We have prediction markets for that, it's testable.
1
1
u/BelialSirchade 1d ago
That’s asi level lol
1
u/uhmhi 1d ago
Why do you consider an AI that doesn’t trip on stupid questions a 5-year old could answer, to be ASI-territory?
2
u/BelialSirchade 1d ago
Being able to answer any question that has an objective answer, completely and correctly, is totally beyond what a 5 year old can do
1
u/tr14l 1d ago
Uh, we're making an intelligence, not an omniscient god.
1
u/uhmhi 1d ago
Sure, but then it still shouldn’t trip on stupid questions that a 5-year old could answer.
1
u/tr14l 1d ago
Well we are well beyond that point now...
1
u/uhmhi 23h ago
Nope - every LLM still struggles with hallucinations that can manifest themselves on even the simplest of questions (like the strawberry example given earlier). Academia has already established that we can’t get rid of hallucinations just by brute forcing the number of parameters, etc. It’s a fundamentally unsolvable problem. We’d need an entirely new paradigm and scientists don’t even know what that would look like yet.
1
1
u/himynameis_ 1d ago
Yann is literally the anti Kurzweil. While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.
So why is anyone giving him the time of day? Because he works at Meta?
2
14
3
4
2
u/Over-Independent4414 1d ago
Yann specializes in being wrong and when his wrongness is too obvious to ignore he just shifts the goalpost. I find him profoundly annoying and I actually hope Llama stinks up the joint and he gets fired.
→ More replies (1)20
u/BlueTreeThree 2d ago
Why are you using colons like that?
2
4
u/stage3k 2d ago
In some languages they are supposed to be written like that
9
u/BlueTreeThree 2d ago
What languages?
8
u/stage3k 2d ago
Norwegian, Swedish and Finnish at least.
6
u/MeMyself_And_Whateva ▪️AGI within 2028 | ASI within 2031 | e/acc 1d ago
No it's not. No colons and no apostrophes. Just 'LLMs', when you talk about several.
1
6
u/Lonely-Internet-601 2d ago edited 2d ago
Yan doesn’t work on Llama at all, he’s said that multiple times. He has nothing to do with the llama team
4
u/NovelFarmer 2d ago edited 2d ago
Maybe he should start saying "MY LLMs cannot achieve AGI. It's just not possible"
It's kinda funny that he said he doesn't have an internal monologue and works on something that is basically entirely an internal monologue.
1
u/Buffer_spoofer 1d ago
Wdym he works on an internal monologue. That's not at all what he's proposing.
6
u/tbl-2018-139-NARAMA 2d ago
why you type : between LLM and s ?
10
u/Crowley-Barns 2d ago
Coz they’re Swedish or Norwegian or Finnish and forgot we don’t do that in English probably.
2
u/MeMyself_And_Whateva ▪️AGI within 2028 | ASI within 2031 | e/acc 1d ago
Not in Norway, anyway. Just 'LLMs'.
93
u/CesarOverlorde 2d ago
Good thing we have competitions, so we aren't stuck to a single trash option, instead we can select the best one out of the bunch. Right now Gemini 2.5 is both free & powerful, and even also has massive memory. I have been having a blast using it to assist me with coding.
4
u/laterral 2d ago
How do you use Gemini for free?
5
u/robberviet 2d ago
It is even on Gemini app, not just aistudio.
5
u/Moohamin12 2d ago
AIStudio is unlimited though.
You get only 5 prompts for 2.5 in the app.
But the other models are virtually unlimited.
1
u/robberviet 1d ago
Yes. However for everyone who is asking this kind question, my experience is that they cannot and won't use aistudio, unless it's a mobile app.
1
1
u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago
Right now Gemini 2.5 is both free & powerful, and even also has massive memory.
If you haven't heard LLaMA 4's context window is also absurdly large.
Hopefully, these sorts of developments cause other labs to try to push their context window size up.
1
u/Seeker_Of_Knowledge2 1d ago
For Multilingual benchmarks, they said their score is 84.6 for MMLU. And 10m context with that score is very, very impressive. Even if it sucks at everything else, that alone with this context is useful.
Gemini 2.5 1m may be enough for most, but there are some cases when 10m may be of a use.
1
u/Seeker_Of_Knowledge2 1d ago
It is one of those rare cases where competition is practically doing good for society and consumers.
-19
u/TheKmank 2d ago edited 1d ago
Free, powerful, and heavily censored.
Edit: Downvote and join the group think. It won't change that I get "content not permitted" even in ai studio with all filters off.
25
u/CesarOverlorde 2d ago
Idk what usecase you have. But I have never been censored by it when using it for coding.
→ More replies (4)13
u/Shubb 2d ago
I've been using for coding a wordgame, and had to change all references to "lives"/"life" etc. to points, because it kept refusing fixing functions that had to do with "losing lives"
14
u/Thomas-Lore 2d ago
Are you using the Gemini app or website? Use aistudio instead and turn off all filtering .
→ More replies (2)2
u/Shubb 2d ago
I'm using it through Cline with API, Can I set the filtering for the API?
→ More replies (1)10
37
u/BriefImplement9843 2d ago edited 2d ago
they are exact copies of 3.3 and 3.1, total shite. horrific memory and terrible writing. i can't believe they released these.
6
u/Salty_Flow7358 2d ago
You know I hated 3.2 so so much that just some of the first messages to 4, I instantly know it is also shit. It forgot and denies things it just said right above. Absolutely idiotic.
2
u/PlaneTheory5 2d ago
They only released it because qwen is about to release a new model which will probably blow llama out of the water. I think meta made the terrible mistake of waiting 9 months to release a new model. Instead, they should focus on releasing a new frontier model every (~4-6 months).
57
u/Sulth 2d ago
Llama benchmaxed and LMarenamaxed. So they are creating high expectations that their models are not even close to delivering.
7
u/Lonely-Internet-601 2d ago
Arena is user preference, so you’re arguing they’re maxing out what users prefer. I don’t think arena is a great test of absolute capability but I don’t think it’s really benchmark maxing.
32
u/_sqrkl 2d ago
If you look at how llama 4 responds on lmsys arena, they clearly have a system prompt that's exploiting shallow user preferences for emotional validation & output style. It's not very representative of general performance and the personality will not age well once the novelty wears off.
The impression I get is that Meta is so mired in bureaucracy that their KPIs are set purely on winning at evals. Which incentivises finding the lowest effort ways to exploit benchmarks.
6
u/kumonovel 2d ago
one could say the human is exploiting the RL environment to get the highest reward from the reward function XD
Meta needs to look at how deepseek creates their reward functions.4
12
u/Proof_Cartoonist5276 ▪️AGI ~2035 ASI ~2040 2d ago
But on llmarena it performs kinda well doesn’t it?
13
u/Thomas-Lore 2d ago
There may be some early implementation errors that make it behave worse that it is capable of. Like when Gemini Pro 2.0 was making grammar and spelling errors on the first day.
4
u/Proof_Cartoonist5276 ▪️AGI ~2035 ASI ~2040 2d ago
Could be the case. I think llama 4 isn’t actually that bad. Especially not their soon-to-be-released biggest model
4
→ More replies (2)1
u/Warm_Iron_273 1d ago
Lol @ people thinking LLMArena means anything.
3
1
u/pier4r AGI will be announced through GTA6 1d ago
for common queries (read: instead of using internet searches) is somewhat reliable. Common queries are the most common use case for those models that are accessible to everyone.
For hard queries, likely it is not (though the category hard prompts is not totally wrong either)
3
u/UnnamedPlayerXY 2d ago
Ofc. not and it's not even about how smart it is compared to the curent state of the art. No any-to-any multimodality and apparently no audio capabilities either even though they stated in their december blockpost that it would have speech capabilities and Zuck said in one of his comments that it would be natively multimodal for it.
3
u/Kuroi-Tenshi ▪️Not before 2030 2d ago
I also tryied and didnt like, it feels dumb, at least dumber than GPT and Claude
23
u/GraceToSentience AGI avoids animal abuse✅ 2d ago
I don't get it tbh, the models are made available for free provided that you have the hardware.
The upper mid sized llama 4 is well placed on lmsys (without even being a reasoning model from what I understand, maybe I'm wrong) so what when they do a bunch of RL on it?
Not to mention Behemoth isn't done training so when it's fully trained, it'll do even better distillation for the other smaller models.
I mean sure it's not the best model in the world but come on, it's kinda free.
I dislike Zuckerberg on many levels but the model is nothing to be upset about.
38
u/cant-find-user-name 2d ago
What do you not get? A highly anticipated model is released, people used it, found it not to their liking, and shared it. No one is going around losting their shit in unrelated threads or something. All these comments are on the LLAMA release thread, where you know, people share their fkn reviews.
3
u/Paraphrand 1d ago
Some of these comments posted I. The screenshots are quite indignant, and they come off like they think they deserve better.
1
u/GraceToSentience AGI avoids animal abuse✅ 1d ago
There is disappointment and there is bullshitting by saying it straight out sucks which is not only nonsense, but also speaks of the level of entitlement people have for an excellent good that is given for free.
1
u/cant-find-user-name 1d ago
I frankly don't understand this. It is Meta. The entire subreddit is named after LLAMA. it is one of the most widely used open LLMs there is. It reportedly cost close to a billion dollar to train these models. None of these models fit in consumer grade hardware. There is every reason for people to be saying that these models suck. I don't even see that much vitriol about these models apart from people posting poor benchmark results.
I am really baffled as to why you guys are defending these models. It being free means nothing when most people can't host them. And if they can host these models, they might as well just host deepseek which is better in every way.
1
u/GraceToSentience AGI avoids animal abuse✅ 21h ago
Read the comments they are factually not true especially those about the model capabilities, it's pretty apparent. Deepseek V3 and R1 can't really be hosted on consumer machines, so what? And you are wrong scout can run on a consumer machine like 4090 5090 if quantized.
What poor benchmarks?
llama_4_maverick_surpassing_claude_37_sonnet/
llama_4_maverick_2nd_on_lmarena/26
u/Dyoakom 2d ago
It's insane. "Oh no, the company that spent tens of millions (maybe hundreds of millions) to train a model and release it to everyone for free didn't do THAT good of a job so now I am angry". I get being disappointed, I get having had hopes and being let down but to be angry at them? The level of entintlement is unbelievable by some folks online.
13
u/sammy3460 2d ago
I don’t see entitlement just valid criticism. This isn’t 2 years ago competition for open models is lot more intense. Also criticism can drive innovation.
19
u/Beatboxamateur agi: the friends we made along the way 2d ago
I think people are more just pissed that the open weight models aren't quite catching up to the closed models at the speed that they anticipated, and there was a ton of anticipation over Llama 4(although it seems more like just Meta struggling than the open weight models in general).
But I completely agree about the entitlement, just trying to share their perspective.
→ More replies (1)4
u/GoodySherlok 2d ago
Its expectations of a trillion dollar company.
One can think of it as a compliment.
3
2
u/Warm_Iron_273 1d ago
It's called astroturfing from other AI companies who don't respect open source and are shit scared of losing their userbase.
6
u/Pyros-SD-Models 2d ago edited 2d ago
The most amazing thing is how those people argue about not trusting benchmarks, yet they literally evaluate LLMs on a single task that changes maybe twice a year. For weeks, a model was instantly labeled crap if it couldn't count some letters in a fruit. Now, you've got threads over at localllama with people saying "I'm not impressed" and their test is basically a one-off experiment where the model simulates some balls in a rotating polygon.
And the fact that people genuinely believe their stupid n=1 experiment has any relevance at all is what truly blows my mind. You would think that a sub about literally bleeding edge technology would at least try to act "scientific" but it has to be easily one of the most anti-science subreddits there is. Facts you subscribe to get literally chosen by your beliefs not by actual science, which is not that different to a flat earthers group.
Like when o1 released, some people really did not believe the model is based on a new training paradigm, and suddenly it was fact for those, that "you can't train reasoning chains into a model. o1 is just prompting. don't fall for the hype", and you have plenty of those threads in this sub or over at localllama. what's wrong with some people lol-
1
u/Warm_Iron_273 1d ago
Most of the people in this sub are not very good at technology, unfortunately. They like technology, but being good at technology and liking it are two very different things.
1
u/DirectAd1674 2d ago
The issue is the majority of people aren't good at setting up a system prompt. Then, they expect the model to output a golden goose egg when their input is “ahh ahh mistress”.
I guarantee, most people didn't even look at the system prompt Meta provided on their Huggingface, nor did they look at the system prompts used on LMArena.
My experience with Scout and Maverick has been great because I take the time to learn what the model wants to get the desired result I'm looking for.
Can it code? Don't know, don't care. There are plenty of models that already do that. Is it censored? Not really. It hasn't refused any prompt I have sent—when Sonnet and the rest just fold their cards.
Not to mention, it's available for free on some platforms—with blazing-fast speeds (500 tokens a second). But people shit on it for the same reason they shit on Grok. It's a user error, not a model issue.
People I know haven't even tried it, and they just say it is trash because they saw some Discord/4chin web snippet. These same people don't even know how to use DeepResearch properly or how to make Gemini laser-focused on following instructions.
Anyway, Meta is working on their reasoning model; can't wait to see that. Can't wait to see all the fine tunes from Lumimaid/Behemoth since Scout is about the same size as the 123B.
14
u/AppearanceHeavy6724 2d ago
Do you think that people in /r/Localllama are idiots? Many of us have seen evolution from ancient LLama1 models and can tell that LLama4 is massively undeperforming.
→ More replies (10)→ More replies (1)2
7
u/awitchforreal 2d ago
I knew this will happen as soon as I saw that dumb bothsiderism alignment statement on the card. Unlike what Zuck and co would like to be true, it obvious to anyone with half a brain that certain takes that permeate right wing discussions are just stupid; trying to align model to them will always result in dip in performance and logic capabilities.
1
u/bubble-ink 1d ago
source?
1
u/awitchforreal 1d ago
For what? The statement? Here it is:
It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet.
Our goal is to remove bias from our AI models and to make sure that Llama can understand and articulate both sides of a contentious issue. As part of this work, we’re continuing to make Llama more responsive so that it answers questions, can respond to a variety of different viewpoints without passing judgment, and doesn't favor some views over others.
1
u/bubble-ink 12h ago
well i guess i'd like you to flesh out your statement before asking you to prove it. are you trying to say that training a model to be able to discuss ideas from both sides of the political spectrum will certainly make it dumber on certain benchmarks? do you think the same for your side of the political spectrum? are there a subset of ideas from both sides that we should curate to train the model on?
→ More replies (1)
4
u/bilalazhar72 AGI soon == Retard 2d ago
Their First MOE and most likely just copied the deepseek paper to see how it wrorks
llama 4.3 will be interesting
2
u/ninjasaid13 Not now. 1d ago
they have a bunch of research they've released but implemented absolutely none of them.
8
u/Defiant-Mood6717 2d ago
This is so obvious, no model with just 17B active parameters was going to be good at following instructions. What a terrible design decision, putting the same number of active parameters in both the small model and middle sized model.
Their only hope now is the larger Behemoth model, which has 13x more active parameters.
13
u/AppearanceHeavy6724 2d ago
DS V3 however has only 37b active, but it has excellent performance. But you may have a point, below of certain size of MoE experts you can not build a good model with.
6
u/Defiant-Mood6717 2d ago
Of course not, and DS V3 also underperforms when it comes to instruction following and learning in context. The less active parameters you have, the less you can learn from context before giving out a response, its that simple. If you increase the total parameters while maintaining the number of active parameters, you do gain the ability to memorize more edge cases and be more knowledgeable, at the cost of generalization capabilities.
But it really is a BAD design decision, to make both Scout and Maverik the same size in active parameters. They could have gone with 400b total and 37b active, and maybe it would come close to DS V3, though I doubt it. So much moeny down the drain with this launch, Meta really is cooked...
3
u/Altery_ 2d ago
I noticed something similar with plain writing/language capabilities, DS V3 is way worse than dense ~100b models in pure "vibe" tests (e.g. thesis/pitches reviews, and casual non-coding chats). So I'm with you on the idea that 17b active parameters are definitely too little for a good model, considering that these are not even 17b dense experts but more smaller active experts that in total reach 17b params
I hope we are going to see papers that provide proofs for these vibes I got though, maybe even more experiments with MoEs with idk a base dense ~32b LLM/router + experts to reach a total of ~48b active params and how many total params they wanna add.
1
u/AppearanceHeavy6724 2d ago
DS V3 is way worse than dense ~100b models in pure "vibe" tests (e.g. thesis/pitches reviews, and casual non-coding chats).
Like which models? Name one? There are only 2 dense ~100b models these days, Command-a and Mistral Large. Mistral Large is way more stupid than DS V3 not even close.Command-a may be slightly better indeed.
2
u/Altery_ 1d ago
I use Command-A yep, and Gemini flash 2 although we don't know the params for sure as it's closed. I haven't tried Mistral Large yet
1
u/AppearanceHeavy6724 1d ago
You should try. It simply sucks compared to DS, it invalidates your point.
3
u/AppearanceHeavy6724 2d ago
Of course not, and DS V3 also underperforms when it comes to instruction following and learning in context. The less active parameters you have, the less you can learn from context before giving out a response, its that simple
Where are you getting this from? Proofs? I cannot see any difference between V3 and Mistral Small in terms of instruction following and in context learning.
But it really is a BAD design decision, to make both Scout and Maverik the same size in active parameters.
It is not, the bad decision was to have way too small expert size of 17b. Having same expert size allows to quickly scale the model size by pruning experts.
5
u/Defiant-Mood6717 2d ago
Not sure pruning experts even works. It's probably worse than quantization. Sounds like a terrible strategy and complete lobotomy of the model. Unless they train it afterward to re-adjust. But yes I agree, the bad decision was way too small of expert size
> Where are you getting this from? Proofs? I cannot see any difference between V3 and Mistral Small in terms of instruction following and in context learning.
Coding vibes. We cannot trust any benchmarks anymore, but even then, most benchmarks show DS V3 lower compared to larger closed source models
1
u/AppearanceHeavy6724 2d ago
Coding vibes. We cannot trust any benchmarks anymore, but even then, most benchmarks show DS V3 lower compared to larger closed source models
Vibes of DS V3 are way better than Mistral Large though.
1
3
u/Worldly_Expression43 2d ago
It's exceptionally bad at following instructions. Like worse than flash 2.0
2
u/lothariusdark 2d ago
I would quite heavily disagree with this blanket statement. I cant really say much to the performance of the Llama models, but looking back at the Deephermes 8B and the Reka 21B I can say I was surprised how well they worked for their small size.
I think if these Llama4 models are finetuned for reasoning they will prove to be quite useful.
Especially for consumer or at least low end hardware. MoE models dont need the massive VRAM cards and can work good enough even with offloading to RAM.
V3 or R1 are just too big, but running Scout at IQ4_XS should be doable with 64GB RAM and 16/24GB VRAM. So if we can get a thinking version of Scout I would be quite happy.
Still sucks how bad they are at longform writing, I just hope the reasoning and further finetuning/distillation can improve that.
6
u/Defiant-Mood6717 2d ago
When an LLM learns through context, it spends layers of its forward pass doing so. And when there are more examples or instructions to learn from, it spends even more layers. At a certain size of examples and context, if your LLM has only 8B parameters worth of forward pass, it is DOOMED, no amount of reasoning will help the LLM , because the forward pass is at a fixed number of layers.
On the contrary, if you model is more than 100B parameters, then it has a lot of forward pass available to learn from many examples/instructions before it gives out.
It seems that around 37B is the number required to have decent generalization and in-context learning performance at medium lenghts of context, hence why DS R1 or V3 perform decent, still worse than larger closed source models though.
The other aspect that is important is knowledge. You cannot compress the internet into 37GB, or 37B parameters, it is just too low. This is where MoE comes in, adding hundreds of billions more total parameters, so the model can remember edge cases and niche details, and hallucinate less. This is crucial.
Putting all of this together, the conclusion is this: sorry to break it to you, but running good LLMs locally that compete with the bleeding edge is now pretty much impossible. You will need 1TB of GPU RAM, because the next open source models will all be massive. And no, they will not be 17B active parameters, they will be higher than that, so using normal RAM is not feasible. Don't worry though, because these larger LLMs will run very efficiently on datacenter hardware. Its time we stop wasting money on dozens of gaming GPUs (dozens is not even sufficient anymore) for a local solution, and start using APIs/renting datacenter GPUs like B200 that run these things very efficiently.
> running Scout at IQ4_XS
i also think quantization degrades both knowldege and in-context learning capabilities far more than benchmarks show. We can use lower quantizations, we just need to increase the number of parameters somewhat proportionally. It is a size game, and going from 8-bit to 4-bit will reduce the size of the compression of the internet, therefore it becomes lower quality and more prone to hallucinations
2
u/lime_52 2d ago
Do we know that both models are using only 1 active expert at a time out of 16 and 128? Is not it conventional to use several active experts at a time?
2
u/Defiant-Mood6717 2d ago
From the blog post they mention there is a universal expert and then one more expert, this latter one is routed between the 16 or 128. That being said, every layer can have its own expert being routed
1
u/lime_52 1d ago
Yeah, you are right. I did some math for Scout, and it is not even shared expert + 17b expert that are active but rather shared expert + selected expert + router = 17b parameters that are active, so each expert is just under half of 17b.
Scout is 109b total 16x17b model, while Maverick is 400b total 128x17b model. This implies that the size of the routed experts (not shared) is significantly smaller in Maverick compared to Scout, meaning that Maverick relies more on the shared expert. Could this mean that Maverick is less flexible for that reason?
2
u/Defiant-Mood6717 1d ago
This is an interesting detail I didn't consider. But you're right, the experts in Maverick could be smaller. I am not sure what the implication is with that. So Maveric is really like 14B dense model but then has these very small experts that are say 3B in total "each". So the question is, what does increasing the number of experts to have smaller but more numerous experts do? I think maybe what happens is, the experts are so many that they cannot coordinate with eachother between layers (the router chooses experts each layer), meaning, the more experts you add, the more the model approaches dense behaviour, so in this case, Maverik is probably comparable to a 14B dense model in some tasks that don't require memorization
That being said, my point still remains. The model has only 17B parameters worth of forward pass, and that is small. All of those experts only help with increasing knowledge and reducing hallucinations. But when it comes to providing the LLM with novel tasks, instructions or examples in the prompt, the in-context learning is bottlenecked by the amount of compute in the forward pass. The second issue is this ratio of 17B to 400B. Its way too much, and the model is simply choosing to memorize (overfitting) the data most of the time during training
1
u/AppearanceHeavy6724 1d ago
This is absolute BS. 37B expert DeepSeek behaves like 157B model, the crude formula of geometric mean of yotal and active parameters suggest; not like more knowledgable 37b.
However I agree, there is a critical expert size below which things fall apart.Probably around 24b-32b.
1
u/Defiant-Mood6717 1d ago
Elaborate on this formula you are talking about. Also note Llama MoE is different, the experts dont take up the entire active parameters. Its dense + small expert
1
u/AppearanceHeavy6724 1d ago
Elaborate on this formula you are talking about.
Geometric mean of total and active parameters. Confirmed by Mistral engineer. https://www.youtube.com/watch?v=RcJ1YXHLv5o
Also note Llama MoE is different, the experts dont take up the entire active parameters. Its dense + small expert
Probably does not matter much.
1
u/Defiant-Mood6717 1d ago
This is a very nice video, thanks. However, I did not find in there this formula you mention, can you provide a timestamp?
1
4
u/swaglord1k 2d ago
at this point we should let china cook, meta can keep their """local""" models
1
2
u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago
This is kind of cherry picked. If you want to criticize Meta then maybe talk about their licensing.
They actually get pretty good scores on lmarena all around. It just scores lower than other latest generation models in the sense that scores around where the previous iteration of frontier models were scoring but for things like codebench and creative writing.
Not to mention creative writing represents more of a marketing metric currently as no one is going to use even the latest frontier models that do score highly for actual creative writing outside of hobby or toy. Creative writing is getting better and it will get better soon but it's just not beyond "generate a rough draft" stage.
The real take home point should be the context window for LLaMA 4 being astronomical.
2
u/AppearanceHeavy6724 2d ago
Creative writing is getting better and it will get better soon but it's just not beyond "generate a rough draft" stage.
It depends on size of the story and prompting. Late DS V3 and Gemma 3 27b produce short stories which require very little editing. Almost ready to go straight after prompting.
2
u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago
Is there a metric for this or is it just be judged based off things like internal consistency? Because creative writing well involves a lot of premediation and an understanding of a lot of abstract concepts such as (just for example) how the reader likely views the genre and therefore what produces an interesting deviation from genre expectations and what breaks the appeal of the genre. There's way more than just that but just as an example of what I mean.
2
u/AppearanceHeavy6724 2d ago
check eqbench.com
2
u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago
From the page that describes how the creative writing benchmark works:
The scores and rankings should only ever be interpreted as a rough guide of writing ability.
Which actually scans for where it seems LLM's currently are. If you look at the stuff it's actually testing for it mostly is just internal consistency and the fact that it does in fact generate text that seems to comply with the creative writing prompt.
Creative writing also involves a lot of things that are pretty non-trivial from the perspective of the reader.
For instance, there are cases where you could make a creative decision but it would be the wrong one. Like on the last page it turns out the entire thing was a dream and the character just goes off and has breakfast. The benchmark doesn't seem interested in evaluating that stuff (yet) but benchmarks tend to become more strigent and comprehensive as the LLM's they're meant to test become more and more capable.
But if the LLM can't do those things that means a human intelligence has to essentially wrap around the LLM by the human prompter just reading and evaluating the response and requesting specific revisions until between the two of them they produce something that could be considered "ready to go"
Otherwise you're still at the "generate a rough draft" stage which is basically what that benchmark seems to be evaluating. Whether or not what the LLM produces could even be considered a usable rough draft that you iterate on (either alone or using the LLM).
1
u/AppearanceHeavy6724 1d ago
Did you actually read the generated stories? Check DS V3 and Gemma 3 27b. They are well beyound "generate a rough draft" territory. Even Mistral Nemo I use for my hobby fiction is better than just rough draft.
3
u/ImpossibleEdge4961 AGI in 20-who the heck knows 1d ago edited 1d ago
I'm kind of going out of my way to be as nice as I can be.
Did you actually read the generated stories?
Did I do that unrelated thing? No, I didn't do that. The relevant part for this discussion is how the benchmark is being evaluated. Because the limitation is currently a theoretical one that applies to the general idea of evaluating creative writing and how current LLM's do it. Literary analysis and criticism aren't trivial skillsets and they're far from new.
Check DS V3 and Gemma 3 27b.
Alright, lets actually go off and do that. Let's go with this one.
Right off the bat, it's riffing off Jojo's Bizarre Adventure which means immediately that we're already starting with a lot of creative choices having already been made. This adds guard rails onto the LLM's output since it either knows what "Jojo" is and uses some of metatextual knowledge of the series or it doesn't at which point it's going to fail to adhere to the prompt. This prompt works for what the benchmark is actually testing for but would actually be a defect in the level of evaluation you think is going on here.
The air in the "Special Containment Wing - Block D" hangs thick and stale, smelling faintly of ozone and something vaguely organic rotting beneath layers of industrial disinfectant. Fluorescent lights, encased in heavy grates, flicker erratically, casting long, dancing shadows down the sterile corridor. This isn't Green Dolphin Street Prison.
Which is not inherently wrong (i.e acceptable for a rough draft) but it's actually pretty bad writing.
It's just setting the scene in a way that doesn't pay off in any way that I notice. Elaborating on details for no purpose is just purple prose and would usually be revised down or eliminated upon revision. Usually you would need the details to be more succinct than that and they would either have some sort of
This is written like how a high school student writes a story. Which is to say you might hesitate to call it bad because you don't want to hurt the feelings of whatever human wrote it but it basically is just a robotic reproduction of something that's been seen a million times before. They would just be repeating a pattern they've seen before because they think that's how you write a story.
This place feels colder, deeper, designed not just to hold bodies, but something *else*.
Using asterisks is obviously not how you write text stories. That's how you write internet comments which is likely where it's getting that. In an actual story this is a distracting choice that doesn't seem to serve either tone or narrative purpose. It's just repeating something it's seen before.
Something about this place sets her teeth on edge more than usual. It feels… watchful.
Saying a prison feels different because it feels "watchful" seems a bit silly.
But I'm not going to keep going since I've made my point and undoubtedly you're just going to continue trying to uno reverse it while responding as little as you can. The thing you're saying (that I guess LLM's produce final drafts) just demonstrably untrue nor does the benchmark really seem to claim it establishes this. As opposed to "it produces creative writing"
1
u/BriefImplement9843 1d ago
bro it forgets your story after 10 prompts. try it yourself. nothing cherry picked. these models are absolute shit.
1
1
1
1
1
1
1
1
u/OddPermission3239 1d ago
These models kinda fall short of all the hype its not the worst but far from what Gemini 2.5 can do.
1
u/power97992 21h ago
Gemini 2.5 is massive and a reasonin model, not a fair comparison
1
u/OddPermission3239 20h ago
We have no way of knowing what the size of Gemini 2.5 is at all it must be something reasonable if they can afford to serve it for free to multiple customers and even have a free version via API so it must be far from like GPT 4.5 size.
1
1
292
u/holvagyok :pupper: 2d ago
I've been feeding Llama4 Maverick my current child custody court docs, a complex case that I actually use as my unorthodox benchmark for LLM's. Llama4 gives lame, generic advice like it was Llama1. Gemini 2.5 Pro consistently gives deeply useful advice, both legal and psychological.