cross-posted from: https://lemmy.world/post/19242887

I can run the full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well.

It’s pretty much perfect! Unlike the last iteration, they’re using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).

  • brucethemoose@lemmy.worldOP
    link
    fedilink
    arrow-up
    1
    ·
    1 day ago

    Oh yeah, I was thinking of free APIs. If you are looking for paid APIs, Deepseek and Cohere are of course great. Gemini Pro is really good too, and free for 50 requests a day. Cerebras API is insanely fast, like way above anything else. Check out Openrouter too, they host tons of models.