Integrating domain expert feedback into agents is the path to production, and MLflow is the way. Check out this end to end example on Databricks www.databricks.com/blog/self-op...
@wesleypasfield
Write at https://open.substack.com/pub/wesleypasfield | Previously Lark Health, AWS/Amazon, GoPro, Nielsen company | Currently Emerging Tech Fellow at US Census, and Adjunct Professor at University of San Diego Applied AI MS
Integrating domain expert feedback into agents is the path to production, and MLflow is the way. Check out this end to end example on Databricks www.databricks.com/blog/self-op...
Not to mention the domestic challenges from AI driven inequality caused by our current incentive structure
Yeah Iβd love to see the equivalent from ChatGPT. While this data is awesome I think very important to emphasize this is adoption by field, not potential capabilities.
www.nytimes.com/2025/02/05/o...
This is such an amazing opinion piece! Time to pivot regulation to logical areas
Certainly not training compute, which is what current regulation specifies. Primary point is that compute required for specific capabilities will be a moving target, exemplified by efficiencies shown recently by deepseek
My NeurIPS paper on LLM regulation through data and evaluation finally up on arxiv - I feel stronger about this approach with the shift to reasoning and away from compute as a proxy for performance arxiv.org/abs/2502.03472
With all the Deepseek news, thought Iβd reshare my NeurIPS paper on data and evaluation based regulation as an alternative or complement to compute. The shift to reasoning makes this even more relevant: wesleypasfield.com/pasfield_neu...
Thank you for sharing this perspective broadly! This is very much in line with my paper at the Neurips RegML Workshop. Data + Evaluation needs to be our focus: wesleypasfield.com/pasfield_neu... "Powering LLM Regulation through Data: Bridging the
Gap from Compute Thresholds to Customer
Experiences
I keep hearing that AI will automate tasks and let people focus on higher leverage tasksβ¦but clearly intention is for AI to try and go up the chain for higher leverage tasks. I think we are hand waving away what our intention is for humans in this agent driven future
Like AGI, depends who you ask!
Has anyone seen an estimate on what will happen to human labor in knowledge work if agentic LLM solutions are successful? Iβve seen the market opportunity from the VC side (10x SaaS) but presumably that would come at the direct cost of human employment?
I donβt have a great answer but I think test time is especially bad because it gives connotation that it is just for test sets (not live applications), I think especially confusing for non tech folks
Itβs easy to lead the witness - itβs important to be as neutral as possible and ask for analysis/alternatives to ensure you are not just getting examples of being agreeable from the models
Iβve found the newsletter personally useful to keep up with research at a greater breadth than before. Iβm using Claude for the paper identification and summary, and everything is serverless on AWS so quite cheap. I hope folks enjoy!
I put together an automated newsletter featured in Data Elixir and Data Science Weekly posts that identifies interesting AI research and summarizes the content, sending out twice a week.
You can sign up here: wesleypasfield.com/aipapers/
And check out the code here: github.com/WesleyPasfie...
Happy New Year! To kick off the year, I've finally been able to format and upload the draft of my AI Research Highlights of 2024 article.
It covers a variety of topics, from mixture-of-experts models to new LLM scaling laws for precision:
Easy prediction for 2025 is that the gains in AI model capability will continue to grow much faster than (a) the vast majority of peopleβs understanding of what AI can do & (b) organizationsβ ability to absorb the pace of change. Social change is much slower than technological change.
I've been trying to write more on Substack - I just published a post on how some of the more recent LLM trends (agents, test-time inference) could impact society moving forward, and what we should do about it:
open.substack.com/pub/wesleypa...
LLM hallucinations are a feature AND a bug
βThis is not just incremental progress; it is new territory, and it demands serious scientific attention.β As inference enhancements drive LLM performance optimizations, regulatory efforts that donβt directly measure model output will be less and less relevant arcprize.org/blog/oai-o3-...
Iβll be at the RegML workshop tomorrow at Neurips in East Meeting Room 13 come say hi!
I just updated the NeurIPS starter pack with many more attendees
Let me know if you'd like to be added
go.bsky.app/BuJXg5q
#NeurIPS2024 #NeurIPS
Strongly agree with the primary takeaway from this argument βYou are getting left behind if you do not adopt chat-based programming as your primary modality.β Not sure if chat will always be the primary medium but LLM assisted/driven development is here to stay sourcegraph.com/blog/the-dea...
Paper contents around using data for domain specific evaluation to enable more logical LLM regulation
I think this extends to regulatory efforts as well. Better benchmarks / means of evaluation will lead to more logical regulation. That is the core principle of this paper I will present at Neurips later this week
They say the ideal use case is βnarrow sets of complex tasks led by expertsβ - thoughts on if that means a singular outcome with a lot of complicated steps, or perhaps a wider set of outcomes but a very defined problem space (or something else)? Having a hard time interpreting that on the surface
I think itβs about the ways the humans are wrong vs AI based systems rather than accountability. We can trace back why the human made the decision whether they are accountable or not, AI being wrong feels much more random
Number 4 is often the real challenge on an enterprise specific problem as itβs not easy to measure the output of LLMs based on generalized nature, or the human experts that would be compared are the ones doing the evaluation
Very happy to see this release from AWS. This makes Bedrock very compelling and the type of practical offering that can help LLM based experiences exit the experimental phase into prod for a specific domain/application aws.amazon.com/about-aws/wh...
Have to remind yourself itβs a token predictor trained on data that very infrequently includes βI donβt know.β I often say βitβs ok to say I donβt know if youβre uncertainβ which anecdotally seems to mitigate a bit, but not sure how much impact or whether that negatively impacts overall response