Technical research on attention mechanisms reveals that KV caches (key-value caches) can be reused across multi-turn conversations to reduce computational overhead. AttentionStore research demonstrates that reusing attention computations can decrease time to first token by up to 88% and improve prompt prefilling throughput significantly.
However, this optimization occurs at the infrastructure level rather than creating persistent context across API calls. Each call still requires explicit context management from the application developer’s perspective.
The long and short of this— beyond the non-deterministic nature of AI output, repeated queries “poison” the models thru these two mechanisms. Attention management, both explicit and likely implicitly (via inferred RL mechanisms) creates massive problems for tool reliability. And this, particularly KV caching, is difficult to quantify except in probabilistic terms.
ʟᴏɴɢ ᴄᴏɴᴛᴇxᴛ ≠ ᴄᴏɴᴛᴇxᴛ ᴛʀᴀɴꜱꜰᴇʀ: ᴍᴏᴅᴇʟꜱ ʟɪᴋᴇ ɢᴇᴍɪɴɪ 1.5 (1ᴍ ᴛᴏᴋᴇɴꜱ) ᴇxᴄᴇʟ ᴀᴛ ɪɴᴛʀᴀ-ᴛᴀꜱᴋ ᴄᴏᴍᴘʀᴇʜᴇɴꜱɪᴏɴ ʙᴜᴛ ᴏꜰꜰᴇʀ ɴᴏ ᴄʀᴏꜱꜱ-ᴄᴀʟʟ ᴄᴏɴᴛɪɴᴜɪᴛʏ ᴡɪᴛʜᴏᴜᴛ ᴏʀᴄʜᴇꜱᴛʀᴀᴛɪᴏɴ .
ᴀᴘɪ ᴄᴀʟʟ ᴄᴏɴꜱɪꜱᴛᴇɴᴄʏ: ᴘᴀʀᴀʟʟᴇʟ ʀᴇQᴜᴇꜱᴛꜱ ᴜɴᴅᴇʀ ᴏɴᴇ ᴋᴇʏ ᴍᴀɢɴɪꜰʏ ɴᴏɴ-ᴅᴇᴛᴇʀᴍɪɴɪꜱᴍ, ᴀꜱ ᴄᴏɴꜰɪʀᴍᴇᴅ ʙʏ ᴏᴘᴇɴᴀɪ ᴄᴏᴍᴍᴜɴɪᴛʏ ʀᴇᴘᴏʀᴛꜱ .
Learn the practical implications of this at the Darkest AI Mastermind.
July 31-Aug 3 (Wisconsin and virtually)