i'd be more interested in tokens/s benchmarks on the longer horizon benchmarks especially ones using first party tool calling. even wall clock time would be cool to see
i'd be more interested in tokens/s benchmarks on the longer horizon benchmarks especially ones using first party tool calling. even wall clock time would be cool to see
i feel like only anthropic can claim this as a true success, one leak of these eval awareness concepts into training data and now any model can pass the eval awareness benchmark
actually so true, i can only stop once i know things will require me to actually look hard and fix things and im not staying up to watch claude fail for 20 minutes straight and make no progress
hey! i was browsing randomly and happen to be traveling in kyoto, i have some free time during the day tomorrow so thought i might as well ask!
i took a hand building class in NYC and loved it, visited Kawai Kanjiro's house and loved it, lmk if you have some time, would love to learn!