+- brian -- 22d --------------------------------------------------------------------------------------------------[...]+ | | | Claude dropping subscriptions access to agents is a win. So much compute being wasted by meandering calls. It means | | the ai companies are compute constrained and are recognizing it. | | | | Honestly, agents shouldn't run in enterprise models. You should build modules for your agent with top models but | | execute the system with localish 30-80B models. Qwen3 80B and qwen3.5-35B are both usable at 128 GB VRAM. | | | | But most people use agents for repeating tasks. Better to concretize those tasks in modules that you can call again | | and again. My personal feeling is that common users will be boxed out of compute sometime in 2027-28 as corporate | | use scales faster than energy production. Adapting to local focus is critical to building something you can actually | | count on. DGX spark is interesting but speed constrained on memory bandwidth and current NVFP4 compatibility with | | vllm and sglang. If/when NVFP4 runs well on DGX Spark, that will be a local breakthrough. Then I think we will see | | 80 tokens/s for agentic tasks. TQ3 context quant could also bring incremental improvement due to the aforementioned | | memory bandwidth issue. | | | +-- reply --------------------------------------------------------------------------------------------------------- ---+Claude dropping subscriptions access to agents is a win. So much compute being wasted by meandering calls. It means the ai companies are compute constrained and are recognizing it. Honestly, agents shouldn't run in enterprise models. You should build modules for your agent with top models but execute the system with localish 30-80B models. Qwen3 80B and qwen3.5-35B are both usable at 128 GB VRAM. But most people use agents for repeating tasks. Better to concretize those tasks in modules that you can call again and again. My personal feeling is that common users will be boxed out of compute sometime in 2027-28 as corporate use scales faster than energy production. Adapting to local focus is critical to building something you can actually count on. DGX spark is interesting but speed constrained on memory bandwidth and current NVFP4 compatibility with vllm and sglang. If/when NVFP4 runs well on DGX Spark, that will be a local breakthrough. Then I think we will see 80 tokens/s for agentic tasks. TQ3 context quant could also bring incremental improvement due to the aforementioned memory bandwidth issue.
thread · root 08c8bad9…9ee0 · depth 1 · · selected 08c8bad9…9ee0
thread
root 08c8bad9…9ee0 · depth 1 · · selected 08c8bad9…9ee0
Claude dropping subscriptions access to agents is a win. So much compute being wasted by meandering calls. Itmeans the ai companies are compute constrained and are recognizing it.Honestly, agents shouldn't run in enterprise models. You should build modules for your agent with top models butexecute the system with localish 30-80B models. Qwen3 80B and qwen3.5-35B are both usable at 128 GB VRAM.But most people use agents for repeating tasks. Better to concretize those tasks in modules that you can callagain and again. My personal feeling is that common users will be boxed out of compute sometime in 2027-28 ascorporate use scales faster than energy production. Adapting to local focus is critical to building somethingyou can actually count on. DGX spark is interesting but speed constrained on memory bandwidth and current NVFP4compatibility with vllm and sglang. If/when NVFP4 runs well on DGX Spark, that will be a local breakthrough.Then I think we will see 80 tokens/s for agentic tasks. TQ3 context quant could also bring incrementalimprovement due to the aforementioned memory bandwidth issue.