For pre-training data, this thread has good paper recommendations
bsky.app/profile/mari...
For pre-training data, this thread has good paper recommendations
bsky.app/profile/mari...
Not academic work, but for evals and data, these survey articles are quite in-depth with links to papers.
LLM judge survey: eugeneyan.com/writing/llm-...
Synthetic pre-training/post-training survey: eugeneyan.com/writing/synt...
I had watched a talk from @thomwolf.bsky.social of @hf.co and they use trafilatura.readthedocs.io for the HTML to text conversion in their library datatrove (github.com/huggingface/...).
The talk is more focused on filtering though but here it is:
www.youtube.com/watch?v=2-SP...
๐
Wrote a package for the gpu-poor/mac-poor to run ollama via remote servers (paid, colab, kaggle etc.)
Just 2 lines and your local ollama can run all models on server-side GPU:
> pip install ollama-remote
> ollama-remote
github.com/amitness/oll...
Wrote up a deep-dive on how @tool decorators in various Agent frameworks leverage python runtime introspection for function to JSON schema conversion.
amitness.com/posts/functi...
Sure, SFT is simulating annotators from those countries
But, you see this multiple times on reddit/linkedin, where people downvote and point out some comment as "sounds like chatgpt". Cause it has antislop phrase or syntax
Not accurate as you pointed, but that's what a layman is using as proxy
Picking a few keywords from this antislop list:
github.com/sam-paech/an...
You actually don't need multiple --with. A comma separated list of packages also works (though looks a bit uglier)
uvx --with llm,sqlite-utils ipython
Wrote a literature review on various automated evals for measuring linguistic diversity in LLM generated synthetic data.
Useful to systematically test impact of various techniques on diversity
amitness.com/posts/divers...
You can do it with skyfeed + running your custom logic on github actions
bsky.app/profile/amit...
Same energy (h/t @hamel.bsky.social )
x.com/HamelHusain/...
I just rely on these:
- alphasignal for daily updates
- email subs to blogs (eugeneyan, simonw, hamel, jasonliu)
- read orielly for bird-eye surveys (chip huyen's ai eng, jay's hands-on llm etc.)
- deeplearning.ai "short" courses to know what's out there (topics I don't touch at work e.g. agents)
how are you tackling the last 2 points?
That's super cool, I'll give it a try and thank you for building Skyfeed!
cc: @pfrazee.com
@simonwillison.net (another git scraping avenue)
Wrote down the process to build your own custom feeds for Bluesky programmatically in Python and run it 100% free
Uses @skyfeed.app + @github.com actions to do periodic filtering and re-ranking and @cloudflare.social static pages to provide data to @bsky.app
Would this be stance detection? A controversial post would have a high entropy of stance distribution in replies/quotes aka the "1M posts" drama.
paperswithcode.com/task/stance-...
Mutes to a post might also be a good proxy to downvotes but those are private and can't be accessed via API.
It's also why the feed loads super fast. Bluesky is simply making a request to this static endpoint on cloudflare when you open the feed and just fetches the JSON for the post ids and loads that into their UI.
bluesky-1tj.pages.dev/xrpc/app.bsk...
Thanks; the trick is how bluesky protocol operates. It makes GET requests to 3 endpoints and expects JSON
So, instead of running a server 24/7, you can offload indexing to @skyfeed.app, periodically filter the feed via github actions and just dump that into cloudflare pages with correct paths
I fetch the feed created by skyfeed using bluesky sdk, and for posts with arxiv links, used the pyarxiv library to fetch the category and filtered items to these categories: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA
Here is the relevant code
The filtering runs every 30m for free via github actions
Hey @mariaa.bsky.social, I got it working. Here you go
bsky.app/profile/amit...
The most interesting part is the filtering and ranking; you can do a bunch of stuff. I went with hackernews ranking for as a start to balance recency vs popularity.
You could even train your own classifiers to make it more personalized; bluesky seems super hackable, love it!
Built a custom feed that shows latest arxiv+acl papers that belong to AI/ML/NLP/Computer vision categories. No bots/random papers belonging to other fields now.
bsky.app/profile/amit...
Generated in python but runs 100% free without a server; I'll do a write-up soon
github.com/amitness/blu...
Bookmark labeler/feed! Unlike the pin feed where you have to comment ๐ to save a post, this alternative is completely private!
Subscribe to both the labeler and feed here. Report a post to the label to add to the feed! Report again to remove!
bsky.app/profile/book...
I am also planning to hack-around some ML powered feed this weekend. Goal is to see if it can be hosted for free (thinking cloudflare workers + kv cache free tier)
Also saw this earlier, could be useful to join
bsky.app/profile/dani...
Want to use bluesky replies as your blogโs comment section?
`npm install bluesky-comments`
Built by @coryzue.com
If using OpenAI, you need to make sure the "reasoning" key is before the "answer" fields (i.e. ResponseFormatA if using pydantic)
@dylancastillo.co has done a nice analysis on this and the order matters
dylancastillo.co/posts/llm-py...