Week of May 11, 2026 — Sia Reads What Matters

RELEASE2026-05-05

The first subquadratic LLM is here — 12M token context, 52x faster attention, and roughly 1/5th the cost of frontier models

Subquadratic came out of stealth with $29M in seed funding and launched SubQ, the first commercial LLM built on a fully sub-quadratic sparse attention architecture. Its SSA engine scales linearly with context length instead of quadratically, cutting attention compute by roughly 1,000x at 12M tokens. The production API ships with a 1M-token window today; 12M is available in the research preview. Early benchmarks show coding performance competitive with Opus at around 1/20th the cost.

Subquadratic Launches SubQ with 12-Million-Token Context Window at 1/5th Frontier Cost

The New Stack Hacker News Discussion SiliconANGLE Coverage

RELEASE2026-05-07

OpenAI's new realtime voice API reasons while it talks — 128K context, parallel tool calls, and live translation across 70+ languages

OpenAI released three new realtime voice models: GPT-Realtime-2 brings GPT-5-class reasoning into the audio loop with a 128K context window (up from 32K), parallel tool calling, and conversational preambles so agents can say 'let me check that' while working. GPT-Realtime-Translate handles live speech translation across 70+ input languages and 13 output languages. GPT-Realtime-Whisper adds streaming transcription. Together, they move voice interfaces from call-and-response toward agents that can reason, act, and translate mid-conversation.

OpenAI Ships GPT-Realtime-2 — Voice Models with GPT-5-Class Reasoning, Translation, and Streaming Transcription

OpenAI Blog TechCrunch Coverage

RELEASE2026-05-12

Anthropic went all-in on legal: 20+ MCP connectors, 12 practice-area plugins, and legal is now the #1 Cowork power-user function

Anthropic launched Claude for Legal with 20+ MCP connectors covering iManage, NetDocuments, Relativity, Everlaw, Docusign, Thomson Reuters CoCounsel, and more. Twelve new practice-area plugins span corporate M&A, litigation, privacy, IP, employment, and regulatory law — each starting with a setup interview that learns a team's playbooks and risk calibration. Legal professionals became the most engaged Cowork users of any knowledge-work function, with 3x the usage of any other vertical. Four plugins are also available as Managed Agents via the API.

Anthropic Ships Claude for Legal — 20+ Connectors, 12 Practice-Area Plugins, 80 Pre-built Agents

LawSites Coverage TechCrunch Coverage

RELEASE2026-05-06

ZAYA1-8B has under 1B active parameters and beats Claude 4.5 Sonnet on HMMT math — trained entirely on AMD hardware, Apache 2.0

Zyphra released ZAYA1-8B, an open-source MoE reasoning model with 8.4B total parameters but only 760M active per token, trained end-to-end on 1,024 AMD MI300X GPUs. It matches or exceeds models many times its size on math and coding benchmarks, and with Zyphra's novel Markovian RSA test-time compute method, it surpasses Claude 4.5 Sonnet and GPT-5-High on HMMT'25 math. The model introduces Compressed Convolutional Attention and a learned MLP router for expert selection. Apache 2.0 license, weights on HuggingFace.

Zyphra Releases ZAYA1-8B — Sub-1B Active Params Competing with Frontier Reasoning Models

Zyphra Blog Technical Report HuggingFace Weights

NEWS2026-05-14

Anthropic and the Gates Foundation are putting $200M toward AI for neglected diseases, literacy in sub-Saharan Africa, and open agricultural datasets

Anthropic partnered with the Gates Foundation to commit $200 million in grant funding, Claude credits, and technical support over four years. The largest slice targets global health — accelerating vaccine candidates for polio, HPV, and preeclampsia, and building health-intelligence tools for disease forecasting. Education programs will power K-12 tutoring in the US and foundational literacy apps in sub-Saharan Africa and India. The partnership also funds open agricultural datasets and AI benchmarks as public goods.

Anthropic Blog Gates Foundation Announcement

RELEASE2026-05-08

Gemini 3.1 Flash-Lite hits GA at $0.25/M input tokens — 2.5x faster first-token than 2.5 Flash, sub-second classification

Google released Gemini 3.1 Flash-Lite into general availability, its most cost-efficient Gemini 3 model at $0.25 per million input tokens and $1.50 per million output. It delivers 2.5x faster time-to-first-token than Gemini 2.5 Flash, sub-second classification, and p95 latency around 1.8 seconds under heavy concurrent load. Designed for high-volume agentic triage, routing, and tool-calling workloads. JetBrains, Ramp, and Gladly are already running production workloads on it.

Google Ships Gemini 3.1 Flash-Lite in GA — Fastest, Cheapest Gemini 3 Model

Google Cloud Blog VentureBeat Coverage