LLMs become much less useful the moment you dismantle bureaucracy.
LLMs become much less useful the moment you dismantle bureaucracy.
The modern internet penalises anyone who thinks for more than 3 seconds before forming a strong opinion. Youβll be happy to know that I only thought for 2 seconds before typing this.
The major barrier here is that the massive budgets get spent on AI researcher/engineer compensation and GPUs, leaving very little to pay for the best domain experts. I think this is a massive miscalculation. The first lab to realise this will quickly become the market leader.
It should therefore come as no surprise that new models like grok 3 and GPT 4.5 feel like small incremental improvements. The focus now should be on improving post-training data. Frankly, the quality kinda sucks even at the best labs.
The idea that scaling up LLMs on human data will produce superhuman performance is magical thinking. The highest attainable performance in any given domain is simply the best human performance. Yes, maybe some insights span across domains, but I donβt think thereβs actually evidence for that.
The people who laugh about hacky βvibe codingβ with LLMs are the same people who think that grok 3 is better than a doctor. Absolutely nuts.
βBelow is the updated codeβ followed by absolutely no code is such an o3-mini thing to do that at this point I donβt really understand why this model exists.
After much time spent looking at reasoning traces from DeepSeek R1 for medical cases, I have to conclude that there isnβt a strong correlation between good reasoning and a good answer.
The really interesting thing is that theyβre not all made equal. Llama 8b distilled can solve medical cases that the Qwen-based R1 (and even R1 itself) cannot. World knowledge still matters for solving real problems and no open models beat Llama in that regard.
Say the benchmark has 100 questions, generate 64 responses per question and then pass@1 is total number correct / 6400.
These R1 distilled models are absolutely amazing on a single turn, but truly horrible on multi-turn conversations.
"Mildly elevated rheumatoid factor has a very low positive predictive value that is completely overwhelmed in magnitude by the negative predictive value of not having any signs or symptoms of rheumatoid arthritis."
My 2 year old has 3 adjectives for the size of things. In increasing order: small, mummy, big.
5.7B tokens to solve 100 tasks??? I donβt understand why weβre thinking of this as being incredibly smart, when what this suggests is that itβs incredibly dumb.
Releasing Jupyter Agents - LLMs running data analysis directly in a notebook!
The agent can load data, execute code, plot results and following your guidance and ideas!
A very natural way to collaborate with an LLM over data and it's just scratching the surface of what's possible soon!
Did you use oil?
What percentage of βrhupusβ is just misdiagnosed SjΓΆgren?
βILD, hyperglobulinemia & Lab abnormalitiesβ sounds like SjD to me!
I just follow everyone and then spend all my time on the quiet posters feed (except when I want to come judge the yappers)
o1 is equal parts brilliant and boring. Very, very boring.
Llamafile is a cheat code.
"Zuckerberg's eyes brimmed with tears, and his heart felt full. He truly loved Big Brother!"
You better believe it. βCalifornia boomerβ is the vibe I am getting from this guy.
Prop stethoscope checks out, though.
Soraβs idea of a hand exam. This 60 year-old rheumatologist is giving me very 2nd year medical student vibes with this bizarre technique. No synovitis was detected this day. #rheumsky
Oh, and if you follow him then youβll end up on a list, which will lead to you seeing less of these people. I genuinely have no interest in seeing his posts (I find him uniquely annoying), but itβs probably worth it.
Ultimately, if performance is anything like previous iterations of Phi, it will greatly underwhelm outside of benchmarks. So the license has no meaning to me.
So much delving