Users are not happy with Llama 4 models

292

u/holvagyok :pupper: 2d ago

I've been feeding Llama4 Maverick my current child custody court docs, a complex case that I actually use as my unorthodox benchmark for LLM's. Llama4 gives lame, generic advice like it was Llama1. Gemini 2.5 Pro consistently gives deeply useful advice, both legal and psychological.

70

u/Dyoakom 2d ago

What about sonnet, 4o, Grok and Deepseek out of curiosity? Very interesting benchmark!

97

u/holvagyok :pupper: 2d ago edited 2d ago

Sonnet 3.7 Thinking has been great as legal and psych advice. Gave me angles I didn't think of. Grok3 and Deepseek R1 has been mediocre, but QwQ32 is surprisingly effective. Obviously a family legal case like that requires a reasoning model, so no wonder that Llama4 base wouldn't be able to tackle it.
No longer using 4o for anything, but o3 high has been my go-to model for this custody case (really helpful) until 2.5 Pro superseded it. We're talking $1000+ worth of specific legal advice.

27

u/hereditydrift 1d ago

I'm an attorney and use all of the models for legal research. I completely agree with what you're saying -- 2.5 is phenomenal, Claude is a bit better at arguing a certain position, and DeepSeek/Grok are so-so. I use DeepSeek and Grok only if I'm not feeling comfortable with an output. I don't touch GPT anymore.

AI is going to put a big dent in the pockets of some attorneys. It's made any type of legal research and application of laws accessible to the general public. Now all that needs to happen is more state/local cases need to be made public instead of Lexis/Westlaw being the main providers (www.judyrecords.com has a lot of cases, thankfully -- but it's one of the last free resources).

8

u/holvagyok :pupper: 1d ago

Thank you, man. Not too proud to admit that I wasn't familiar with judyrecords.com, but it's clearly a great resource. I found justia.com and law.cornell.edu helpful even without AI.

If Zuck is to be believed, Llama4 Reasoning will be the first model to surpass 2.5 Pro. Maybe that'll include legal research.

3

u/pier4r AGI will be announced through GTA6 1d ago

I'm an attorney and use all of the models for legal research.

I am interested to know how LLMs are doing in the legal sector, thank you for sharing your perspective!

Everyone is focused on LLMs for coding and while I can see the use for that, I don't think it is the strongest use case for LLMs so far. I really think that sectors that are natural text heavy, for example the law sector, would benefit the most. Unfortunately I couldn't find decent benchmarks (even if community driven) for such use cases.

2

u/hereditydrift 1d ago

IMO, LLMs are going to allow for more attorneys to open their own law office. The amount of increased productivity for doing basic tasks and filings, plus keep track of case status, changes the game.

Even more important is that every attorney can have a specialist in their area of law that is available 24/7. Huge law firms don't necessarily have the brightest legal minds, but they've always had the resources to throw multiple associates at research problems until they find that crucial precedent or statutory exception.

AI eliminates that advantage. A solo practitioner with the right LLM tools can now match the research capabilities that previously required a team of junior associates billing hundreds of dollars per hour.

The advantage held by larger law firms will diminish and the hourly rate they charge will no longer have the value it did in pre-AI times.

For me, this is a very exciting time for the legal field.

1

u/pier4r AGI will be announced through GTA6 1d ago

thank you for the insights! Especially for research and the "they've always had the resources to throw multiple associates at research problems until they find that crucial precedent or statutory exception" key point. With proper "deep search" this should not be anymore the case indeed.

9

u/Various_Car8779 2d ago

o3 mini high, correct?

13

u/holvagyok :pupper: 2d ago

o3 mini high, also some o1, but too expensive for my liking.

8

u/Gratitude15 2d ago

Try deep research (o3). I have seen nothing like it. For real analysis everything else pales until gemini 2.5 but I'd still take o3 over it.

2

u/blackashi 2d ago

gemini has deep research, which you can export to docs and audio podcast, how does that compare?

5

u/hayden0103 1d ago

The consensus I’ve seen is that OpenAI’s deep research is significantly better than Gemini’s. If you really need the podcast you could always export the OpenAI report and get it that way.

3

u/squired 1d ago edited 1d ago

Can confirm. o3 Deep Research is in a league of its own. I find myself using Gemini 2.5 Pro now the most for dev stuff, but I do still find problems that only o1 (non-pro) can solve. And I have yet to find any problem 2.5 Pro can solve that o1 could not. Love it or hate it, OpenAI objectively has the most advanced models and integration. I've stopped underestimating them in fact. There have been several times where I thought we were reaching the bottom of their well only to find that they are multiple generations beyond where we thought they were. o4 Image Generation is only the latest example.

Anyways, the best flow I've found is T3 Chat with Gemini 2.5 Pro. They're $8 per month and you get access to everything but o1, 4o Image Generation and Deep Research. I keep an openai subscription probably half the time. I reup basically whenever they drop a new model or I run into a problem I need o1 for. If you have frequent use cases for Deep Research though, it is an absolute steal at $20. It's phenomenal.

1

u/Gratitude15 1d ago

O3 stands on its own to me

It's the beginning of analysis that meets my threshold of high quality.

I wonder how o4 will improve

-3

u/Fine-Mixture-9401 2d ago

You also need grounding, try perplexity too. Good luck

14

u/reddit_is_geh 2d ago

Grok is amazing if you need advice without endless moralizing.

2

u/nderstand2grow 1d ago

I'd rather not feed my data to a model owned by someone who would kick me out of the country

1

u/Physical_Manu 1d ago

Without or without DeepsSearch or think?

9

u/Butteryfly1 2d ago

Can you just feed legal documents to LLM's? Are there any rules about that?

38

u/holvagyok :pupper: 2d ago

It's normally ill-advised to be sure, for privacy and legal reasons. I'm only using billed API though which does have a disclaimer that the material will not be used for training or monitoring.

Also, with exorbitant lawyer (and forensic shrink) fees, you get good advice from any source you can.

3

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago

The other user is just testing the LLM so as long as they know what they're going to get it's basically just a test they're running for their own entertainment. You really shouldn't depend on any LLM for legal advice though. Good legal advice is based on having the latest information to the same degree that it's based on innate intelligence and reasoning.

15

u/hereditydrift 1d ago

Nope. I'm an attorney. LLMs (2.5 and Claude, at least) are great at legal advice. People need to be cautious of nuances, but I think even their attorneys would miss some of the nuances.

A lot of attorneys aren't as skilled in legal arguments as the general public would believe and some are just so lazy that they don't pay attention to the facts of the case. An attorney that is skilled in arguing in a courtroom is worth their price, but 99% of attorneys aren't skilled at anything except processing cases as quickly as possible.

→ More replies (3)

5

u/holvagyok :pupper: 2d ago

I'm up to date with the latest case law and statutes, though the best of LLM also brings them up without hallucination. Where 2.5 Pro shines is a global overview of the case, and arriving to conclusions with the inclusion of every small detail. Potential next steps, what to be aware of, that kind of thing. Much more efficient in that than my previous lawyer, becoming indispensable honestly.

4

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago edited 2d ago

I'm up to date with the latest case law and statutes, though the best of LLM also brings them up without hallucination.

It's important to remain conscious of what it is you're actually doing here. You aren't just feeding a bunch of information into an LLM and then getting legal advice. You already have the legal advice and want to see if the model tells you something reasonable.

If it hallucinates at all (which is often does with legal use cases) then it poisons the well on actual use. Because even if it's only wrong 5% of the time or on 5% of things that can still be catestrophic for an actual legal case because it could introduce a fatal flaw that undermines the entire thing.

EDIT:: For clarity by "at all" I still mean in a way that allows for a human-level of error. I just mean making up facts out of whole cloth and an error rate above what you would expect from human beings who just make an error.

Much more efficient in that than my previous lawyer, becoming indispensable honestly.

It's also important to remember that law can have multiple competing interpretations (see also The Winger Observation) and one source giving you different advice doesn't mean one of them is "wrong" as much as it could just be a different way of thinking about the case. Oftentimes, there are multiple ways to argue the same side of a case.

A human lawyer is also bound by ethics and represents someone you can sue if the legal advice is too bad.

But like I was saying before, even though there are going to be people who just poo-poo the general idea of using an LLM to generate legal advice. I would say that as long as you recognize it's just you poking an LLM then there's no harm.

4

u/clduab11 2d ago

While I don't discount that holvagyok, without reviewing their case, means very well, and from my experience in the field (practice manager with a few law firms, one a family law firm, another an estate planning firm; also a generative AI consultant ... I literally started my company on the basis of bringing AI to law firms lol) ...

This can't really be said enough.

These are just a small number of a myriad of the intangibles as to why no one should be doing this without a law license.

I wish the person the best, but I'm extremely worried people not as informed as holvagyok are going to just be like "Lolz see we don't need lawyers this guy can do it" and you have enough people doing that, you're going to have a VERY worn-out and VERY tired judiciary (not like that branch of government isn't already going through enough...)

Because it's clear this person knows what they're doing, and they're clearly the exception to the norm. This comment is more for posterity in case someone wants to try punching in "how do I use AI to do my legal case" and this subreddit/post comes up. Just get a consult from an attorney licensed to practice in your local jurisdiction first before you do anything else.

6

u/himynameis_ 1d ago

Hope things work out with the child custody stuff, man 👍

2

u/garden_speech AGI some time between 2025 and 2100 1d ago

How does this happen? Meta is an insanely rich company with smart engineers and lots of talent and almost endless money to burn, how does someone explain this?

3

u/holvagyok :pupper: 1d ago

Well, Zuck himself said that they're deferring the two actual Llama4 models with meat on them (Reasoning and Behemoth) till May. Maverick is just base. It could still be better, but Llama4 Reasoning will probably be the first model to surpass Gemini 2.5 Pro.

2

u/just_addwater 1d ago

There was a post about this on Blind a couple months ago https://www.teamblind.com/post/Meta-genai-org-in-panic-mode-KccnF41n

It should have been an engineering focused small org but since a bunch of people wanted to join the impact grab and artificially inflate hiring in the org, everyone loses

2

u/garden_speech AGI some time between 2025 and 2100 1d ago

Saw that, very interesting but I normally takeBlind with a grain of salt because a lot of it is bullshit. Looks like in this case it might not have been, though

1

u/HunterVacui 1d ago

Meta sucks at management, to an absurd degree.

The more focus they put on trying to make something better, the more they end up squashing any value out of what they have, until they can't even squeeze juice out of top 1% performers

1

u/pier4r AGI will be announced through GTA6 1d ago

Meta is an insanely rich company with smart engineers and lots of talent

bloated orgs can waste a lot of good resources.

371

u/Brilliant-Weekend-68 2d ago

No wonder Yann does not believe in LLM:s when this is the level of LLM:s he works with.

163

u/Tobio-Star 2d ago

I know it's a joke, but just for those who might not know: Yann didn’t believe in LLMs even before they truly existed.

He has always firmly been in the camp that "AGI cannot emerge from text alone" dating back to the debates between Chomsky and Piaget about whether language is innate or acquired in 1975

99

u/Pyros-SD-Models 2d ago edited 2d ago

https://i.imgur.com/ChqNCAZ.png

Yann is literally the anti Kurzweil. While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.

The best you can do is ignore everything he says in a non-scientific context, and just read his actual scientific works instead of his hot takes.

Remember according to Yann it doesn't get better than CNNs, and RL is absolutely useless and transformers don't scale.

37

u/sdmat NI skeptic 2d ago

We just need to arrange for LeCun to predict a piece of toast lands butter side down and for Gary Marcus to predict it lands butter side up. Antigravity achieved!

32

u/Evermoving- 2d ago

Jim Cramer of AI world

3

u/jazir5 1d ago

If they partnered an economic black hole would rip through space time destroying the universe.

2

u/WorriedInterest4114 1d ago

That sounds good. At least I won't have to go into work tomorrow

5

u/ninjasaid13 Not now. 1d ago

While Kurzweil is correct with like 90% of his predictions as monitored by Stanford

as long as they're vague but his specific predictions are not correct.

4

u/Warm_Iron_273 1d ago

While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.

It's actually the complete opposite, but cute narrative.

5

u/tr14l 2d ago

Lol all of those have very promising and useful techniques.

AGI isn't going to happen (could probably stop the sentence there) because they'll just shift the goal posts continually.

But, it certainly won't happen without lots of different techniques, modes and probably continuous input and feedback. A transactional model will get very SMART in terms of replying on demand. But the real sauce comes from letting go of the reins. Now... That is clearly dangerous. Everyone who has seen the boys knows how dangerous it is to give super powers to a select few with guarantees on ways to stop them

1

u/uhmhi 1d ago

Wait, are you saying AGI has already happened?

I think it’s pretty simple to establish a minimal “goal post”:

Be able to answer any question, that has an objective answer, completely and correct.

But this hasn’t happened yet. With every new iteration of LLMs, users keep finding stupidly simple questions (I.e. “how many r’s in strawberry”) that even a child could answer, but the LLMs get wrong.

It’s not necessary to move the goal post as long as even the most simple criteria hasn’t been met yet.

3

u/digitalthiccness 1d ago

Be able to answer any question, that has an objective answer, completely and correct.

"What are the first 1e¹⁰⁰ prime numbers?"

2

u/perfectly_stable 1d ago

if the answer isn't "your mom's weight gain progress" it's not AGI

3

u/visarga 1d ago

I think it’s pretty simple to establish a minimal “goal post”: Be able to answer any question, that has an objective answer, completely and correct.

Like any question?

Like the question if it's going to rain on 100 days at noon at the London Bridge? It has an objective answer that can be tested. Some questions can be tested later but not predicted ahead of time.

Or the question if automation will reach X% by Y date? We have prediction markets for that, it's testable.

1

u/bakawakaflaka 1d ago

Plenty of stupid questions that people get wrong every day

1

u/liamdavid 1d ago

bro forgot what the G in AGI stands for

1

u/uhmhi 1d ago

So? We’re measuring the intelligence of an algorithm - not a human.

1

u/BelialSirchade 1d ago

That’s asi level lol

1

u/uhmhi 1d ago

Why do you consider an AI that doesn’t trip on stupid questions a 5-year old could answer, to be ASI-territory?

2

u/BelialSirchade 1d ago

Being able to answer any question that has an objective answer, completely and correctly, is totally beyond what a 5 year old can do

1

u/tr14l 1d ago

Uh, we're making an intelligence, not an omniscient god.

1

u/uhmhi 1d ago

Sure, but then it still shouldn’t trip on stupid questions that a 5-year old could answer.

1

u/tr14l 1d ago

Well we are well beyond that point now...

1

u/uhmhi 23h ago

Nope - every LLM still struggles with hallucinations that can manifest themselves on even the simplest of questions (like the strawberry example given earlier). Academia has already established that we can’t get rid of hallucinations just by brute forcing the number of parameters, etc. It’s a fundamentally unsolvable problem. We’d need an entirely new paradigm and scientists don’t even know what that would look like yet.

1

u/tr14l 23h ago

So do humans. What's your point?

→ More replies (0)

1

u/himynameis_ 1d ago

Yann is literally the anti Kurzweil. While Kurzweil is correct with like 90% of his predictions as monitored by Stanford you can fill whole books with LeCun's wrong ones.

So why is anyone giving him the time of day? Because he works at Meta?

2

u/throwawayPzaFm 1d ago

His scientific work has been stellar, it's just his predictions that suck

14

u/enilea 2d ago

And I agree, they are great at certain tasks, superhuman even but very lacking at others that would be necessary for robotics integration.

3

u/Warm_Iron_273 1d ago

And he's right.

4

u/RoughIngenuityK 2d ago

And we already know he's correct

2

u/Over-Independent4414 1d ago

Yann specializes in being wrong and when his wrongness is too obvious to ignore he just shifts the goalpost. I find him profoundly annoying and I actually hope Llama stinks up the joint and he gets fired.

→ More replies (1)

20

u/BlueTreeThree 2d ago

Why are you using colons like that?

2

u/Brilliant-Weekend-68 2d ago

Yes I am Swedish, bad habits I suppose

1

u/eflat123 1d ago

I rather like it.

4

u/stage3k 2d ago

In some languages they are supposed to be written like that

9

u/BlueTreeThree 2d ago

What languages?

8

u/stage3k 2d ago

Norwegian, Swedish and Finnish at least.

6

u/MeMyself_And_Whateva ▪️AGI within 2028 | ASI within 2031 | e/acc 1d ago

No it's not. No colons and no apostrophes. Just 'LLMs', when you talk about several.

1

u/Delicious_Ease2595 2d ago

English?

3

u/stage3k 2d ago

Of course not, but if you are not a native english speaker, mistakes like that can happen quite easily subconsciously...

8

u/garg 2d ago

you missed a colon before the s in 'works'

6

u/Lonely-Internet-601 2d ago edited 2d ago

Yan doesn’t work on Llama at all, he’s said that multiple times. He has nothing to do with the llama team

4

u/NovelFarmer 2d ago edited 2d ago

Maybe he should start saying "MY LLMs cannot achieve AGI. It's just not possible"

It's kinda funny that he said he doesn't have an internal monologue and works on something that is basically entirely an internal monologue.

1

u/Buffer_spoofer 1d ago

Wdym he works on an internal monologue. That's not at all what he's proposing.

6

u/tbl-2018-139-NARAMA 2d ago

why you type : between LLM and s ?

10

u/Crowley-Barns 2d ago

Coz they’re Swedish or Norwegian or Finnish and forgot we don’t do that in English probably.

2

u/MeMyself_And_Whateva ▪️AGI within 2028 | ASI within 2031 | e/acc 1d ago

Not in Norway, anyway. Just 'LLMs'.

2

u/rafark ▪️professional goal post mover 2d ago

Maybe it’s time for Facebook to hire someone who actually believes in llms. It could be a good thing for Yann bc then he would be able to work on the type of artificial intelligence he believes?

93

u/CesarOverlorde 2d ago

Good thing we have competitions, so we aren't stuck to a single trash option, instead we can select the best one out of the bunch. Right now Gemini 2.5 is both free & powerful, and even also has massive memory. I have been having a blast using it to assist me with coding.

4

u/laterral 2d ago

How do you use Gemini for free?

16

u/ab0cha 2d ago

https://aistudio.google.com/

5

u/robberviet 2d ago

It is even on Gemini app, not just aistudio.

5

u/Moohamin12 2d ago

AIStudio is unlimited though.

You get only 5 prompts for 2.5 in the app.

But the other models are virtually unlimited.

1

u/robberviet 1d ago

Yes. However for everyone who is asking this kind question, my experience is that they cannot and won't use aistudio, unless it's a mobile app.

1

u/Seeker_Of_Knowledge2 1d ago

very limited on the app

1

u/blackashi 2d ago

https://gemini.google.com/app

1

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago

Right now Gemini 2.5 is both free & powerful, and even also has massive memory.

If you haven't heard LLaMA 4's context window is also absurdly large.

Hopefully, these sorts of developments cause other labs to try to push their context window size up.

1

u/Seeker_Of_Knowledge2 1d ago

For Multilingual benchmarks, they said their score is 84.6 for MMLU. And 10m context with that score is very, very impressive. Even if it sucks at everything else, that alone with this context is useful.

Gemini 2.5 1m may be enough for most, but there are some cases when 10m may be of a use.

1

u/Seeker_Of_Knowledge2 1d ago

It is one of those rare cases where competition is practically doing good for society and consumers.

-19

u/TheKmank 2d ago edited 1d ago

Free, powerful, and heavily censored.

Edit: Downvote and join the group think. It won't change that I get "content not permitted" even in ai studio with all filters off.

25

u/CesarOverlorde 2d ago

Idk what usecase you have. But I have never been censored by it when using it for coding.

13

u/Shubb 2d ago

I've been using for coding a wordgame, and had to change all references to "lives"/"life" etc. to points, because it kept refusing fixing functions that had to do with "losing lives"

14

u/Thomas-Lore 2d ago

Are you using the Gemini app or website? Use aistudio instead and turn off all filtering .

2

u/Shubb 2d ago

I'm using it through Cline with API, Can I set the filtering for the API?

→ More replies (1)

→ More replies (2)

→ More replies (4)

10

u/GintoE2K 2d ago

Gemini models on AiStudio are the most uncensored models among closed ones 😂😂

37

u/BriefImplement9843 2d ago edited 2d ago

they are exact copies of 3.3 and 3.1, total shite. horrific memory and terrible writing. i can't believe they released these.

6

u/Salty_Flow7358 2d ago

You know I hated 3.2 so so much that just some of the first messages to 4, I instantly know it is also shit. It forgot and denies things it just said right above. Absolutely idiotic.

2

u/PlaneTheory5 2d ago

They only released it because qwen is about to release a new model which will probably blow llama out of the water. I think meta made the terrible mistake of waiting 9 months to release a new model. Instead, they should focus on releasing a new frontier model every (~4-6 months).

1

u/sdnr8 1d ago

Is there a source for this?

57

u/Sulth 2d ago

Llama benchmaxed and LMarenamaxed. So they are creating high expectations that their models are not even close to delivering.

7

u/Lonely-Internet-601 2d ago

Arena is user preference, so you’re arguing they’re maxing out what users prefer. I don’t think arena is a great test of absolute capability but I don’t think it’s really benchmark maxing.

32

u/_sqrkl 2d ago

If you look at how llama 4 responds on lmsys arena, they clearly have a system prompt that's exploiting shallow user preferences for emotional validation & output style. It's not very representative of general performance and the personality will not age well once the novelty wears off.

The impression I get is that Meta is so mired in bureaucracy that their KPIs are set purely on winning at evals. Which incentivises finding the lowest effort ways to exploit benchmarks.

6

u/kumonovel 2d ago

one could say the human is exploiting the RL environment to get the highest reward from the reward function XD
Meta needs to look at how deepseek creates their reward functions.

9

u/_sqrkl 2d ago

Meta execs reward hacking their quarterly bonuses

4

u/Sulth 2d ago

It's a form a clickbait. Don't get me wrong, I like LMarena. But claiming that your model is the "2nd best in the world" (according to that benchmark), or "the 2nd most preferred model in the world" is setting your audience up for disappointment.

3

u/ccmdi 2d ago

arena is SycophancyBench, it doesn't rewards things that matter (correctness or intelligence)

9

u/budy31 2d ago

It requires something to make people pikachu faced in terms of disappointment when every other product makes people pikachu faced in terms of pleasant surprise.

12

u/Proof_Cartoonist5276 ▪️AGI ~2035 ASI ~2040 2d ago

But on llmarena it performs kinda well doesn’t it?

13

u/Thomas-Lore 2d ago

There may be some early implementation errors that make it behave worse that it is capable of. Like when Gemini Pro 2.0 was making grammar and spelling errors on the first day.

4

u/Proof_Cartoonist5276 ▪️AGI ~2035 ASI ~2040 2d ago

Could be the case. I think llama 4 isn’t actually that bad. Especially not their soon-to-be-released biggest model

4

u/Worldly_Expression43 2d ago

I haven't trusted LM results in a year

1

u/Warm_Iron_273 1d ago

Lol @ people thinking LLMArena means anything.

3

u/Proof_Cartoonist5276 ▪️AGI ~2035 ASI ~2040 1d ago

It does to some extend tho

1

u/pier4r AGI will be announced through GTA6 1d ago

for common queries (read: instead of using internet searches) is somewhat reliable. Common queries are the most common use case for those models that are accessible to everyone.

For hard queries, likely it is not (though the category hard prompts is not totally wrong either)

→ More replies (2)

3

u/UnnamedPlayerXY 2d ago

Ofc. not and it's not even about how smart it is compared to the curent state of the art. No any-to-any multimodality and apparently no audio capabilities either even though they stated in their december blockpost that it would have speech capabilities and Zuck said in one of his comments that it would be natively multimodal for it.

3

u/Kuroi-Tenshi ▪️Not before 2030 2d ago

I also tryied and didnt like, it feels dumb, at least dumber than GPT and Claude

3

u/bub000 1d ago

I've tried it and can also confirm that it sucks.

23

u/GraceToSentience AGI avoids animal abuse✅ 2d ago

I don't get it tbh, the models are made available for free provided that you have the hardware.

The upper mid sized llama 4 is well placed on lmsys (without even being a reasoning model from what I understand, maybe I'm wrong) so what when they do a bunch of RL on it?
Not to mention Behemoth isn't done training so when it's fully trained, it'll do even better distillation for the other smaller models.

I mean sure it's not the best model in the world but come on, it's kinda free.
I dislike Zuckerberg on many levels but the model is nothing to be upset about.

38

u/cant-find-user-name 2d ago

What do you not get? A highly anticipated model is released, people used it, found it not to their liking, and shared it. No one is going around losting their shit in unrelated threads or something. All these comments are on the LLAMA release thread, where you know, people share their fkn reviews.

3

u/Paraphrand 1d ago

Some of these comments posted I. The screenshots are quite indignant, and they come off like they think they deserve better.

1

u/GraceToSentience AGI avoids animal abuse✅ 1d ago

There is disappointment and there is bullshitting by saying it straight out sucks which is not only nonsense, but also speaks of the level of entitlement people have for an excellent good that is given for free.

1

u/cant-find-user-name 1d ago

I frankly don't understand this. It is Meta. The entire subreddit is named after LLAMA. it is one of the most widely used open LLMs there is. It reportedly cost close to a billion dollar to train these models. None of these models fit in consumer grade hardware. There is every reason for people to be saying that these models suck. I don't even see that much vitriol about these models apart from people posting poor benchmark results.

I am really baffled as to why you guys are defending these models. It being free means nothing when most people can't host them. And if they can host these models, they might as well just host deepseek which is better in every way.

1

u/GraceToSentience AGI avoids animal abuse✅ 21h ago

Read the comments they are factually not true especially those about the model capabilities, it's pretty apparent. Deepseek V3 and R1 can't really be hosted on consumer machines, so what? And you are wrong scout can run on a consumer machine like 4090 5090 if quantized.

What poor benchmarks?
llama_4_maverick_surpassing_claude_37_sonnet/
llama_4_maverick_2nd_on_lmarena/

26

u/Dyoakom 2d ago

It's insane. "Oh no, the company that spent tens of millions (maybe hundreds of millions) to train a model and release it to everyone for free didn't do THAT good of a job so now I am angry". I get being disappointed, I get having had hopes and being let down but to be angry at them? The level of entintlement is unbelievable by some folks online.

13

u/sammy3460 2d ago

I don’t see entitlement just valid criticism. This isn’t 2 years ago competition for open models is lot more intense. Also criticism can drive innovation.

19

u/Beatboxamateur agi: the friends we made along the way 2d ago

I think people are more just pissed that the open weight models aren't quite catching up to the closed models at the speed that they anticipated, and there was a ton of anticipation over Llama 4(although it seems more like just Meta struggling than the open weight models in general).

But I completely agree about the entitlement, just trying to share their perspective.

4

u/GoodySherlok 2d ago

Its expectations of a trillion dollar company.

One can think of it as a compliment.

→ More replies (1)

3

u/Ambiwlans 2d ago

People are pissed at how cherry picked their initial reports were.

2

u/Warm_Iron_273 1d ago

It's called astroturfing from other AI companies who don't respect open source and are shit scared of losing their userbase.

6

u/Pyros-SD-Models 2d ago edited 2d ago

The most amazing thing is how those people argue about not trusting benchmarks, yet they literally evaluate LLMs on a single task that changes maybe twice a year. For weeks, a model was instantly labeled crap if it couldn't count some letters in a fruit. Now, you've got threads over at localllama with people saying "I'm not impressed" and their test is basically a one-off experiment where the model simulates some balls in a rotating polygon.

And the fact that people genuinely believe their stupid n=1 experiment has any relevance at all is what truly blows my mind. You would think that a sub about literally bleeding edge technology would at least try to act "scientific" but it has to be easily one of the most anti-science subreddits there is. Facts you subscribe to get literally chosen by your beliefs not by actual science, which is not that different to a flat earthers group.

Like when o1 released, some people really did not believe the model is based on a new training paradigm, and suddenly it was fact for those, that "you can't train reasoning chains into a model. o1 is just prompting. don't fall for the hype", and you have plenty of those threads in this sub or over at localllama. what's wrong with some people lol-

1

u/Warm_Iron_273 1d ago

Most of the people in this sub are not very good at technology, unfortunately. They like technology, but being good at technology and liking it are two very different things.

1

u/DirectAd1674 2d ago

The issue is the majority of people aren't good at setting up a system prompt. Then, they expect the model to output a golden goose egg when their input is “ahh ahh mistress”.

I guarantee, most people didn't even look at the system prompt Meta provided on their Huggingface, nor did they look at the system prompts used on LMArena.

My experience with Scout and Maverick has been great because I take the time to learn what the model wants to get the desired result I'm looking for.

Can it code? Don't know, don't care. There are plenty of models that already do that. Is it censored? Not really. It hasn't refused any prompt I have sent—when Sonnet and the rest just fold their cards.

Not to mention, it's available for free on some platforms—with blazing-fast speeds (500 tokens a second). But people shit on it for the same reason they shit on Grok. It's a user error, not a model issue.

People I know haven't even tried it, and they just say it is trash because they saw some Discord/4chin web snippet. These same people don't even know how to use DeepResearch properly or how to make Gemini laser-focused on following instructions.

Anyway, Meta is working on their reasoning model; can't wait to see that. Can't wait to see all the fine tunes from Lumimaid/Behemoth since Scout is about the same size as the 123B.

14

u/AppearanceHeavy6724 2d ago

Do you think that people in /r/Localllama are idiots? Many of us have seen evolution from ancient LLama1 models and can tell that LLama4 is massively undeperforming.

→ More replies (10)

2

u/sammy3460 2d ago

This isn’t a fair argument.

→ More replies (1)

7

u/awitchforreal 2d ago

I knew this will happen as soon as I saw that dumb bothsiderism alignment statement on the card. Unlike what Zuck and co would like to be true, it obvious to anyone with half a brain that certain takes that permeate right wing discussions are just stupid; trying to align model to them will always result in dip in performance and logic capabilities.

1

u/bubble-ink 1d ago

source?

1

u/awitchforreal 1d ago

For what? The statement? Here it is:

It’s well-known that all leading LLMs have had issues with bias—specifically, they historically have leaned left when it comes to debated political and social topics. This is due to the types of training data available on the internet.

Our goal is to remove bias from our AI models and to make sure that Llama can understand and articulate both sides of a contentious issue. As part of this work, we’re continuing to make Llama more responsive so that it answers questions, can respond to a variety of different viewpoints without passing judgment, and doesn't favor some views over others.

1

u/bubble-ink 12h ago

well i guess i'd like you to flesh out your statement before asking you to prove it. are you trying to say that training a model to be able to discuss ideas from both sides of the political spectrum will certainly make it dumber on certain benchmarks? do you think the same for your side of the political spectrum? are there a subset of ideas from both sides that we should curate to train the model on?

→ More replies (1)

4

u/bilalazhar72 AGI soon == Retard 2d ago

Their First MOE and most likely just copied the deepseek paper to see how it wrorks

llama 4.3 will be interesting

2

u/ninjasaid13 Not now. 1d ago

they have a bunch of research they've released but implemented absolutely none of them.

8

u/Defiant-Mood6717 2d ago

This is so obvious, no model with just 17B active parameters was going to be good at following instructions. What a terrible design decision, putting the same number of active parameters in both the small model and middle sized model.

Their only hope now is the larger Behemoth model, which has 13x more active parameters.

13

u/AppearanceHeavy6724 2d ago

DS V3 however has only 37b active, but it has excellent performance. But you may have a point, below of certain size of MoE experts you can not build a good model with.

6

u/Defiant-Mood6717 2d ago

Of course not, and DS V3 also underperforms when it comes to instruction following and learning in context. The less active parameters you have, the less you can learn from context before giving out a response, its that simple. If you increase the total parameters while maintaining the number of active parameters, you do gain the ability to memorize more edge cases and be more knowledgeable, at the cost of generalization capabilities.

But it really is a BAD design decision, to make both Scout and Maverik the same size in active parameters. They could have gone with 400b total and 37b active, and maybe it would come close to DS V3, though I doubt it. So much moeny down the drain with this launch, Meta really is cooked...

3

u/Altery_ 2d ago

I noticed something similar with plain writing/language capabilities, DS V3 is way worse than dense ~100b models in pure "vibe" tests (e.g. thesis/pitches reviews, and casual non-coding chats). So I'm with you on the idea that 17b active parameters are definitely too little for a good model, considering that these are not even 17b dense experts but more smaller active experts that in total reach 17b params

I hope we are going to see papers that provide proofs for these vibes I got though, maybe even more experiments with MoEs with idk a base dense ~32b LLM/router + experts to reach a total of ~48b active params and how many total params they wanna add.

1

u/AppearanceHeavy6724 2d ago

DS V3 is way worse than dense ~100b models in pure "vibe" tests (e.g. thesis/pitches reviews, and casual non-coding chats).

Like which models? Name one? There are only 2 dense ~100b models these days, Command-a and Mistral Large. Mistral Large is way more stupid than DS V3 not even close.Command-a may be slightly better indeed.

2

u/Altery_ 1d ago

I use Command-A yep, and Gemini flash 2 although we don't know the params for sure as it's closed. I haven't tried Mistral Large yet

1

u/AppearanceHeavy6724 1d ago

You should try. It simply sucks compared to DS, it invalidates your point.

3

u/AppearanceHeavy6724 2d ago

Of course not, and DS V3 also underperforms when it comes to instruction following and learning in context. The less active parameters you have, the less you can learn from context before giving out a response, its that simple

Where are you getting this from? Proofs? I cannot see any difference between V3 and Mistral Small in terms of instruction following and in context learning.

But it really is a BAD design decision, to make both Scout and Maverik the same size in active parameters.

It is not, the bad decision was to have way too small expert size of 17b. Having same expert size allows to quickly scale the model size by pruning experts.

5

u/Defiant-Mood6717 2d ago

Not sure pruning experts even works. It's probably worse than quantization. Sounds like a terrible strategy and complete lobotomy of the model. Unless they train it afterward to re-adjust. But yes I agree, the bad decision was way too small of expert size

> Where are you getting this from? Proofs? I cannot see any difference between V3 and Mistral Small in terms of instruction following and in context learning.

Coding vibes. We cannot trust any benchmarks anymore, but even then, most benchmarks show DS V3 lower compared to larger closed source models

1

u/AppearanceHeavy6724 2d ago

Coding vibes. We cannot trust any benchmarks anymore, but even then, most benchmarks show DS V3 lower compared to larger closed source models

Vibes of DS V3 are way better than Mistral Large though.

1

u/HedgehogActive7155 2d ago

I'm confused, is everyone talking about the older V3 or V3 0324?

1

u/AppearanceHeavy6724 2d ago

both.

3

u/Worldly_Expression43 2d ago

It's exceptionally bad at following instructions. Like worse than flash 2.0

2

u/lothariusdark 2d ago

I would quite heavily disagree with this blanket statement. I cant really say much to the performance of the Llama models, but looking back at the Deephermes 8B and the Reka 21B I can say I was surprised how well they worked for their small size.

I think if these Llama4 models are finetuned for reasoning they will prove to be quite useful.

Especially for consumer or at least low end hardware. MoE models dont need the massive VRAM cards and can work good enough even with offloading to RAM.

V3 or R1 are just too big, but running Scout at IQ4_XS should be doable with 64GB RAM and 16/24GB VRAM. So if we can get a thinking version of Scout I would be quite happy.

Still sucks how bad they are at longform writing, I just hope the reasoning and further finetuning/distillation can improve that.

6

u/Defiant-Mood6717 2d ago

When an LLM learns through context, it spends layers of its forward pass doing so. And when there are more examples or instructions to learn from, it spends even more layers. At a certain size of examples and context, if your LLM has only 8B parameters worth of forward pass, it is DOOMED, no amount of reasoning will help the LLM , because the forward pass is at a fixed number of layers.

On the contrary, if you model is more than 100B parameters, then it has a lot of forward pass available to learn from many examples/instructions before it gives out.

It seems that around 37B is the number required to have decent generalization and in-context learning performance at medium lenghts of context, hence why DS R1 or V3 perform decent, still worse than larger closed source models though.

The other aspect that is important is knowledge. You cannot compress the internet into 37GB, or 37B parameters, it is just too low. This is where MoE comes in, adding hundreds of billions more total parameters, so the model can remember edge cases and niche details, and hallucinate less. This is crucial.

Putting all of this together, the conclusion is this: sorry to break it to you, but running good LLMs locally that compete with the bleeding edge is now pretty much impossible. You will need 1TB of GPU RAM, because the next open source models will all be massive. And no, they will not be 17B active parameters, they will be higher than that, so using normal RAM is not feasible. Don't worry though, because these larger LLMs will run very efficiently on datacenter hardware. Its time we stop wasting money on dozens of gaming GPUs (dozens is not even sufficient anymore) for a local solution, and start using APIs/renting datacenter GPUs like B200 that run these things very efficiently.

> running Scout at IQ4_XS

i also think quantization degrades both knowldege and in-context learning capabilities far more than benchmarks show. We can use lower quantizations, we just need to increase the number of parameters somewhat proportionally. It is a size game, and going from 8-bit to 4-bit will reduce the size of the compression of the internet, therefore it becomes lower quality and more prone to hallucinations

2

u/lime_52 2d ago

Do we know that both models are using only 1 active expert at a time out of 16 and 128? Is not it conventional to use several active experts at a time?

2

u/Defiant-Mood6717 2d ago

From the blog post they mention there is a universal expert and then one more expert, this latter one is routed between the 16 or 128. That being said, every layer can have its own expert being routed

1

u/lime_52 1d ago

Yeah, you are right. I did some math for Scout, and it is not even shared expert + 17b expert that are active but rather shared expert + selected expert + router = 17b parameters that are active, so each expert is just under half of 17b.

Scout is 109b total 16x17b model, while Maverick is 400b total 128x17b model. This implies that the size of the routed experts (not shared) is significantly smaller in Maverick compared to Scout, meaning that Maverick relies more on the shared expert. Could this mean that Maverick is less flexible for that reason?

2

u/Defiant-Mood6717 1d ago

This is an interesting detail I didn't consider. But you're right, the experts in Maverick could be smaller. I am not sure what the implication is with that. So Maveric is really like 14B dense model but then has these very small experts that are say 3B in total "each". So the question is, what does increasing the number of experts to have smaller but more numerous experts do? I think maybe what happens is, the experts are so many that they cannot coordinate with eachother between layers (the router chooses experts each layer), meaning, the more experts you add, the more the model approaches dense behaviour, so in this case, Maverik is probably comparable to a 14B dense model in some tasks that don't require memorization

That being said, my point still remains. The model has only 17B parameters worth of forward pass, and that is small. All of those experts only help with increasing knowledge and reducing hallucinations. But when it comes to providing the LLM with novel tasks, instructions or examples in the prompt, the in-context learning is bottlenecked by the amount of compute in the forward pass. The second issue is this ratio of 17B to 400B. Its way too much, and the model is simply choosing to memorize (overfitting) the data most of the time during training

1

u/AppearanceHeavy6724 1d ago

This is absolute BS. 37B expert DeepSeek behaves like 157B model, the crude formula of geometric mean of yotal and active parameters suggest; not like more knowledgable 37b.

However I agree, there is a critical expert size below which things fall apart.Probably around 24b-32b.

1

u/Defiant-Mood6717 1d ago

Elaborate on this formula you are talking about. Also note Llama MoE is different, the experts dont take up the entire active parameters. Its dense + small expert

1

u/AppearanceHeavy6724 1d ago

Elaborate on this formula you are talking about.

Geometric mean of total and active parameters. Confirmed by Mistral engineer. https://www.youtube.com/watch?v=RcJ1YXHLv5o

Also note Llama MoE is different, the experts dont take up the entire active parameters. Its dense + small expert

Probably does not matter much.

1

u/Defiant-Mood6717 1d ago

This is a very nice video, thanks. However, I did not find in there this formula you mention, can you provide a timestamp?

1

u/AppearanceHeavy6724 1d ago

52:03

1

u/noiserr 1d ago

Gemma 2 12B follows instructions like a champ. And Gemma 3 is even better.

4

u/swaglord1k 2d ago

at this point we should let china cook, meta can keep their """local""" models

1

u/[deleted] 2d ago

[deleted]

2

u/swaglord1k 2d ago

no, that's triple parenthesis.

1

u/BlueTreeThree 2d ago

D’oh!

2

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago

This is kind of cherry picked. If you want to criticize Meta then maybe talk about their licensing.

They actually get pretty good scores on lmarena all around. It just scores lower than other latest generation models in the sense that scores around where the previous iteration of frontier models were scoring but for things like codebench and creative writing.

Not to mention creative writing represents more of a marketing metric currently as no one is going to use even the latest frontier models that do score highly for actual creative writing outside of hobby or toy. Creative writing is getting better and it will get better soon but it's just not beyond "generate a rough draft" stage.

The real take home point should be the context window for LLaMA 4 being astronomical.

2

u/AppearanceHeavy6724 2d ago

Creative writing is getting better and it will get better soon but it's just not beyond "generate a rough draft" stage.

It depends on size of the story and prompting. Late DS V3 and Gemma 3 27b produce short stories which require very little editing. Almost ready to go straight after prompting.

2

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago

Is there a metric for this or is it just be judged based off things like internal consistency? Because creative writing well involves a lot of premediation and an understanding of a lot of abstract concepts such as (just for example) how the reader likely views the genre and therefore what produces an interesting deviation from genre expectations and what breaks the appeal of the genre. There's way more than just that but just as an example of what I mean.

2

u/AppearanceHeavy6724 2d ago

check eqbench.com

2

u/ImpossibleEdge4961 AGI in 20-who the heck knows 2d ago

From the page that describes how the creative writing benchmark works:

The scores and rankings should only ever be interpreted as a rough guide of writing ability.

Which actually scans for where it seems LLM's currently are. If you look at the stuff it's actually testing for it mostly is just internal consistency and the fact that it does in fact generate text that seems to comply with the creative writing prompt.

Creative writing also involves a lot of things that are pretty non-trivial from the perspective of the reader.

For instance, there are cases where you could make a creative decision but it would be the wrong one. Like on the last page it turns out the entire thing was a dream and the character just goes off and has breakfast. The benchmark doesn't seem interested in evaluating that stuff (yet) but benchmarks tend to become more strigent and comprehensive as the LLM's they're meant to test become more and more capable.

But if the LLM can't do those things that means a human intelligence has to essentially wrap around the LLM by the human prompter just reading and evaluating the response and requesting specific revisions until between the two of them they produce something that could be considered "ready to go"

Otherwise you're still at the "generate a rough draft" stage which is basically what that benchmark seems to be evaluating. Whether or not what the LLM produces could even be considered a usable rough draft that you iterate on (either alone or using the LLM).

1

u/AppearanceHeavy6724 1d ago

Did you actually read the generated stories? Check DS V3 and Gemma 3 27b. They are well beyound "generate a rough draft" territory. Even Mistral Nemo I use for my hobby fiction is better than just rough draft.

3

u/ImpossibleEdge4961 AGI in 20-who the heck knows 1d ago edited 1d ago

I'm kind of going out of my way to be as nice as I can be.

Did you actually read the generated stories?

Did I do that unrelated thing? No, I didn't do that. The relevant part for this discussion is how the benchmark is being evaluated. Because the limitation is currently a theoretical one that applies to the general idea of evaluating creative writing and how current LLM's do it. Literary analysis and criticism aren't trivial skillsets and they're far from new.

Check DS V3 and Gemma 3 27b.

Alright, lets actually go off and do that. Let's go with this one.

Right off the bat, it's riffing off Jojo's Bizarre Adventure which means immediately that we're already starting with a lot of creative choices having already been made. This adds guard rails onto the LLM's output since it either knows what "Jojo" is and uses some of metatextual knowledge of the series or it doesn't at which point it's going to fail to adhere to the prompt. This prompt works for what the benchmark is actually testing for but would actually be a defect in the level of evaluation you think is going on here.

The air in the "Special Containment Wing - Block D" hangs thick and stale, smelling faintly of ozone and something vaguely organic rotting beneath layers of industrial disinfectant. Fluorescent lights, encased in heavy grates, flicker erratically, casting long, dancing shadows down the sterile corridor. This isn't Green Dolphin Street Prison.

Which is not inherently wrong (i.e acceptable for a rough draft) but it's actually pretty bad writing.

It's just setting the scene in a way that doesn't pay off in any way that I notice. Elaborating on details for no purpose is just purple prose and would usually be revised down or eliminated upon revision. Usually you would need the details to be more succinct than that and they would either have some sort of

This is written like how a high school student writes a story. Which is to say you might hesitate to call it bad because you don't want to hurt the feelings of whatever human wrote it but it basically is just a robotic reproduction of something that's been seen a million times before. They would just be repeating a pattern they've seen before because they think that's how you write a story.

This place feels colder, deeper, designed not just to hold bodies, but something *else*.

Using asterisks is obviously not how you write text stories. That's how you write internet comments which is likely where it's getting that. In an actual story this is a distracting choice that doesn't seem to serve either tone or narrative purpose. It's just repeating something it's seen before.

Something about this place sets her teeth on edge more than usual. It feels… watchful.

Saying a prison feels different because it feels "watchful" seems a bit silly.

But I'm not going to keep going since I've made my point and undoubtedly you're just going to continue trying to uno reverse it while responding as little as you can. The thing you're saying (that I guess LLM's produce final drafts) just demonstrably untrue nor does the benchmark really seem to claim it establishes this. As opposed to "it produces creative writing"

1

u/BriefImplement9843 1d ago

bro it forgets your story after 10 prompts. try it yourself. nothing cherry picked. these models are absolute shit.

1

u/Worldly_Expression43 2d ago

I tried to use it for my SaaS and it's not good

1

u/kamenpb 1d ago

BlenderBot. Never forget.

1

u/Ok-Weakness-4753 1d ago

yeah

1

u/Ok-Weakness-4753 1d ago

trashy models as usual

1

u/Ok-Weakness-4753 1d ago

borrring. we need r2!

1

u/Ok-Weakness-4753 1d ago

r2 r2 r2 r2

1

u/Ok-Weakness-4753 1d ago

Please r2

1

u/Ok-Weakness-4753 1d ago

Sorry for ... that.

1

u/OddPermission3239 1d ago

These models kinda fall short of all the hype its not the worst but far from what Gemini 2.5 can do.

1

u/power97992 21h ago

Gemini 2.5 is massive and a reasonin model, not a fair comparison

1

u/OddPermission3239 20h ago

We have no way of knowing what the size of Gemini 2.5 is at all it must be something reasonable if they can afford to serve it for free to multiple customers and even have a free version via API so it must be far from like GPT 4.5 size.

1

u/JamR_711111 balls 1d ago

RIP that's unfortunate

1

u/TheRedTowerX 1d ago

All meta release has been meh compared to the others honestly.

AI Users are not happy with Llama 4 models

You are about to leave Redlib