omg everybody go draw a horse this is what the internet was made for
gradient.horse
omg everybody go draw a horse this is what the internet was made for
gradient.horse
Pretty shameful, Ms. Anand!
congrats! last time we caught up you were i think just acquiring a much smaller electric boat... cool to hear you've been Scaling Up. is the cat in the water already?
afaict you either need to argue that i've infringed by producing a copy of an article that i've never seen; or that the model creator infringed, and the model does "contain" a copy of the article in some sense, even though the model is definitely not "just" a copy of those inputs...
suppose: the nyt example was for a open-weights model like llama; i get the model and recover an nyt article from it, like they demonstrated in court. i now have an illegal copy; where's the copy from?
sure, happy to leave it here, and ultimately this is something a judge will decide as you say! but i will drop a last thought at the end here anyways since i already typed it up...
sorry if i'm being pedantic! but this kind of hair-split is the sort of the thing the law cares about and i think the article is a little fuzzy on... π
is it? in a section with a summary like "itβs still critical that training not involve copying", it seems relevant that quite a bit of copying happens in practice, and that it's hard to prevent.
for sure, but "my system only copies a small percentage of (the ~entire internet)" and "i wish my system did not copy data so often" are not arguments that copying is not happening...
good news! they shared the prompts: nytco-assets.nytimes.com/2023/12/Laws...
and if they do it in public it can be copyright infringement!
and i don't find the article's treatment of this super convincing: it agrees that all models do this, says it's bad, and then ignores it in the conclusions...
i was happy to see you share this article; i think it's more right than most things written on this topic! but eg. when the nyt can get a model to spit out its articles nearly word for word, i think there's a pretty clear argument that a copy has been made and distributed...
for ~all common models, it's quite easy to get an llm to spit out portions of its training data verbatim... hard to argue that distributing those models is not distributing that data in a legal sense!
thanks for sharing! read the vulnerability report from citizenlab... looks like the issue was in the keyboard, and citizenlab still recommend using signal. (with all the security settings turned on!)
oh hey congrats! i remember you were taking another swing at this - glad to see it over the line