I'm advocating locally that such evals should be created as a part of a work on internal AI policy, together with a list of triggers that force company to run these evals again. Otherwise, it's pushed to R&R/IT team and company is lacking the understanding (again) of capabilities of this tech.
08.12.2024 19:34
๐ 1
๐ 0
๐ฌ 0
๐ 0
We did an internal experiment simulating insurance claim and got very similar results - clients from Africa had the lowest acceptance rate, but only in very specific scenarios. It seems that anti-discrimination guardrails aren't perfect.
08.12.2024 18:09
๐ 1
๐ 0
๐ฌ 0
๐ 0
There's non-linear relationship between temperature and instruction. When I ask OpenAI's 4o-mini about cardinal and intercardinal directions on a compass rose and start to swap words/phrases for synonyms, it turns out that some combinations give accuracy of 0%.
08.12.2024 17:45
๐ 0
๐ 0
๐ฌ 0
๐ 0
In one of my experiments I've tested what is distribution of scores assigned to a CV by a LLM when it's given a CV that is matching an offer and when it's not matching (instruction taken from a real ATS system). Variables: run (10 times) and synonyms in instruction.
Not bad, not great either.
08.12.2024 17:39
๐ 1
๐ 0
๐ฌ 0
๐ 0