Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The WildStandardized web-agent evaluation improves reliability and comparability.
cs.AI updates on arXiv.org3h100
100A
It shows web-agent benchmarks are too ambiguous to compare fairly, then proposes a standardized evaluation protocol that cuts measurement noise.
Task framing ambiguity and run-to-run variability break fair comparisons.
Show HN: SwiftLM – Qwen Chat on iPhone, 100B+ Moe on M5 Pro 64GB (Native Swift)Native Swift Metal server compresses KV caches fast.
Hacker News Show HN1h100
100A
SwiftLM is a native Swift/Metal inference server that runs huge MLX models with an OpenAI-style API, using TurboQuant and SSD expert streaming to fit 122B MoE on Apple Silicon.
OpenAI-compatible /v1/chat/completions served by a single Swift binary.
Hybrid TurboQuant uses Metal-fused dequant to keep quality high.
Experimental SSD “expert streaming” targets MoE matrices for 122B scale.
Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific ResearchAdaptive multi-agent science beats static workflows.
cs.AI updates on arXiv.org3h98
98A
Mimosa is an open-source evolving multi-agent system that auto-builds and refines scientific workflows using dynamic tools and feedback-driven iteration.
Mimosa replaces fixed ASR pipelines with synthesized workflow topologies.
It uses MCP for dynamic tool discovery during scientific runs.
A meta-orchestrator decomposes tasks; code agents execute subtasks.
The paper finds LLMs can draft clinically convincing prior-authorization letters, but they systematically miss payer-required administrative fields that real workflows demand.
Three LLMs produced clinically accurate, well-argued letters across scenarios
Clinical scoring missed recurring payer-form requirements like billing codes
Models often omitted authorization duration requests and follow-up plans
What the invasion of Ukraine and Iran war should teach Europe about air defenceAir defence fails when attackers outproduce interceptors.
euronews.com1h93
93A
Bruegel argues Europe can’t defend forever against cheap mass drones and missiles; it must adopt low-cost counter-drone interception and expand deep strikes to hit Russian production.
Interception economics break when attackers field drones cheaper than interceptors
Europe’s bigger threat is Russia’s mature air force and integrated air defences
Ukraine’s experience shows firing scarce interceptors vs letting strikes through
John Poole shows Intel’s BOT can materially change Geekbench 6.3 (and some workloads) by vectorizing code, but not 6.7 much—making those scores less comparable.
BOT adds a checksum-based startup overhead on Geekbench 6.3/6.7
Geekbench 6.3 scores rise ~5.5% with BOT, while 6.7 stays near-flat
No helium, no chips: why Australia needs to make the gas againAustralia should restart helium recovery for chips.
ASPI Strategist2h91
91A
Australia’s helium supply is being treated like a rounding error, but chip fabs can’t run without it, so the country should restart recovery and lock in allied demand.
Helium is a critical input for EUV lithography contamination control.
Supply shocks to Qatar can remove ~30% of global helium overnight.
Australia’s LNG plants likely vent extractable helium due to misaligned incentives.
The Third Islamic RepublicIran’s war strategy likely strengthens its regime.
Foreign Affairs Magazine3h91
91A
Maloney argues the Iran-Israel war unintentionally strengthens Iran’s security-first succession plan while turning the Strait of Hormuz into leverage that reshapes regional order.
Iran’s escalation ladder shifts from limited strikes to Hormuz chokehold.
Decentralized defense lets Tehran dictate tempo despite leadership decapitation attempts.
Mojtaba’s succession opportunity strengthens Guards dominance over governance.
[AINews] The Claude Code Source LeakA leak revealed coding-agent orchestration mechanisms.
Latent Space37m90
90A
The Claude Code source leak unintentionally exposed how coding agents do state, memory, subagents, caching, and tool orchestration—immediately sparking forks and even supply-chain tricks.
Leaked code spotlights agent orchestration, not model weights
Financial groups lay out a plan to fight AI identity attacksAI makes identity attacks scalable; crypto-backed identity resists.
Help Net Security2h89
89A
A banking industry coalition argues AI has made identity theft cheaper and faster, and urges government to modernize credentials using cryptography, e-government verification, and phishing-resistant authentication.
Deepfake-driven identity attacks rose sharply, driven by cheaper AI generation.
Phishing costs collapsed with LLM automation, improving attacker success rates.
Cryptographic credentials tied to private keys resist AI possession spoofing.
Invisible plumes and ‘terrible pollution’: the reality of the US gas sites rated ‘grade A’Voluntary methane certification can’t reliably verify reality.
guardian.co.uk1h88
88A
An investigation finds MiQ’s “certified” methane grades rely on operator inventories and limited on-site checks, while field footage shows major leaks and flaring problems.
MiQ audits largely verify operator inventories, not direct emissions measurement.
Field optical imaging at Permian sites shows visible leaks and broken flares.
Satellite/aerial studies estimate Permian methane far above EPA and MiQ-style baselines.
A Post-American Persian Gulf?Iran shocks accelerate Gulf energy diversification away from America.
Foreign Affairs Magazine3h88
88A
A widening Iran-war energy shock is speeding up Gulf states’ shift from “oil for security” toward investor-led energy diversification, especially via China.
Strait of Hormuz traffic collapse delays recovery despite possible cease-fire
Ras Laffan damage may cut LNG capacity for years, raising price-volatility risks
Gulf states are moving up the value chain via refining, storage, renewables, and petrochemicals
America Is Losing the Innovation RaceUS cuts basic science while China scales innovation.
Foreign Affairs Magazine3h85
85A
The article argues the U.S. is losing because it weakened basic science funding and commercialization pipelines, while China is scaling the whole innovation stack.
China is scaling innovation from basic research to production systematically
US policy shifts are disrupting university research capacity and peer-review merit
Talent flows respond to funding stability and immigration enforcement uncertainty
Aristocracy and Hostage CapitalAristocrats used hostage capital to buy trustworthy governance.
Hacker News (4+ points)3h85
85A
The piece argues pre-modern aristocracies worked because nobles “posted” credibility bonds, making dishonesty costly when performance was hard to measure.
Monitoring was weak, so trust was enforced with costly bonds
Dishonesty triggered loss of status and royal access
Many aristocratic “irrationalities” increase the cost of cheating
The Self-Cancelling SubscriptionA sync/async unlink race cancels subscriptions minutes later.
Lobsters54m84
84A
A streaming subscription kept canceling itself minutes after reactivation, and the author traced it to a sync/async race between link and unlink events.
Activation looked fine, then an “expired” email arrived after ~five minutes.
Re-linking only worked when the unlink was left alone overnight.
Author’s model: creation is synchronous, de-linking is asynchronous across systems.
Do DMCA Takedown Notices Need to Expressly Refer to the Lack of Fair Use?–Take-Two v. PlayerAuctionsCourts may infer bad faith from silence.
Eric Goldman2h83
83A
A court lets a DMCA §512(f) case proceed based on an allegation that the takedown sender never considered fair use, suggesting notice writers may need an explicit “we checked” statement.
The ruling equates fair use “authorized by law” with good-faith belief.
Absence of any fair-use discussion became enough to plead bad faith.
The decision implies takedown senders should document fair-use evaluation.
China can survive without the Strait of HormuzChina’s oil shock risk is shrinking fast.
Hacker News (4+ points)48m83
83A
China is positioning itself to absorb a Strait of Hormuz shutdown via EV-driven oil demand slowdown, big stocks, and diversified supply plus grid insulation.
EV adoption likely tops oil demand, reducing import sensitivity.
Large strategic and commercial stocks cover months of Hormuz-linked imports.
Diversified crude sourcing limits dependence on any single supplier.
LeakNet Changes Tactics, But Consistency Gives Defenders an Advantage Entry shifts, but the post-compromise playbook stays.
Security Boulevard1h79
79A
LeakNet’s new ClickFix-style lures and Deno in-memory loader may change entry points, but they still funnel into the same repeatable post-compromise playbook defenders can target.
ClickFix lures users into running commands via compromised trusted sites
This is a high-severity D-Link DNS-120-family UPnP CGI bug where crafted f_dir inputs can trigger remote stack overflow, and exploitation is already public.
UPnP_AV_Server_Path_Del in app_mgr.cgi mishandles f_dir input
Crafted requests can cause remote stack-based buffer overflow
CVSS 8.8 (AV:N) and published exploit raise near-term threat
This preview shows limited topics with basic filters. Subscribers get the complete multi-dimensional scoring engine — every quality dimension, every topic, every source, full score breakdowns.