Burstiness and Perplexity

Write something

17d •

Understanding and Mitigating AI Hallucinations

This briefing document summarizes the core insights from the provided sources regarding the phenomenon of AI hallucinations, their underlying causes, and proposed solutions. 1. The Nature of AI Hallucinations AI hallucinations are defined as instances where large language models (LLMs) "confidently make things up," producing "plausible yet incorrect statements instead of admitting uncertainty." This differs fundamentally from human perceptual hallucinations. The problem is not necessarily about making models smarter or training them on more data; rather, it stems from the way AI models are currently trained and evaluated. Key Facts: - LLMs often provide "overconfident, plausible falsehoods," which "diminish their utility." - Examples include generating incorrect birthdates or dissertation titles for known individuals, even when explicitly asked to respond "only if known." - Hallucinations can be "intrinsic" (contradicting the user's prompt, e.g., miscounting letters in a word) or "extrinsic" (contradicting training data or external reality). Quote: "Language models are known to produce overconfident, plausible falsehoods, which diminish their utility. This error mode is known as 'hallucination,' though it differs fundamentally from the human perceptual experience." – why-language-models-hallucinate.pdf 2. Root Causes: Training and Evaluation Incentives The core argument across both sources is that AI models hallucinate because the current training and evaluation paradigms inadvertently reward guessing over honesty. Main Themes: - "Terrible Test-Takers": LLMs are "essentially training AI to be terrible test-takers who guess instead of admitting uncertainty." - Binary Scoring: Most benchmarks operate like "multiple-choice exams" with "binary 0-1 scheme[s]" where "1 point for a correct answer and none for blanks or IDKs." This incentivizes guessing, as "leaving an answer blank guarantees failure but guessing gives you a 1-in-365 chance of nailing someone's birthday." - Vicious Cycle: This leads to models learning to "bluff," generating "confident-sounding nonsense rather than admit uncertainty." As models become more capable, they continue to hallucinate because "that's what scores best on tests." - Statistical Origins (Pretraining): Hallucinations "originate simply as errors in binary classification." Even with error-free training data, the statistical objectives minimized during pretraining can lead to errors. This is due to factors like: - Arbitrary Facts: When there's no learnable pattern in data (e.g., specific birthdays), models are likely to hallucinate, with the hallucination rate being at least the "fraction of training facts that appear once." - Poor Models: The model architecture itself may be insufficient to represent the concept well (e.g., trigram models struggling with longer dependencies) or may not be a good fit even if expressive enough. - Computational Hardness: Problems that are computationally intractable for even superhuman AI will lead to errors if the model attempts to solve them rather than defer. - Distribution Shift (OOD Prompts): Prompts that differ significantly from training data can induce errors. - GIGO (Garbage In, Garbage Out): Training corpora often contain factual errors, which base models can replicate. - Persistence (Post-Training): Despite efforts to reduce hallucinations during post-training (e.g., RLHF), they persist because "guessing when unsure maximizes expected score under a binary 0-1 scheme." Existing primary evaluations "overwhelmingly penalize uncertainty."

Guerin Green

Jun 29 •

Research

𝐌𝐔𝐕𝐄𝐑𝐀: 𝐓𝐡𝐞 𝐒𝐞𝐚𝐫𝐜𝐡 𝐑𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧 𝐓𝐡𝐚𝐭 𝐂𝐡𝐚𝐧𝐠𝐞𝐬 𝐄𝐯𝐞𝐫𝐲𝐭𝐡𝐢𝐧𝐠

𝐇𝐨𝐰 𝐆𝐨𝐨𝐠𝐥𝐞 𝐉𝐮𝐬𝐭 𝐌𝐚𝐝𝐞 𝐌𝐮𝐥𝐭𝐢-𝐕𝐞𝐜𝐭𝐨𝐫 𝐒𝐞𝐚𝐫𝐜𝐡 𝐋𝐢𝐠𝐡𝐭𝐧𝐢𝐧𝐠 𝐅𝐚𝐬𝐭 (𝐀𝐧𝐝 𝐖𝐡𝐲 𝐄𝐯𝐞𝐫𝐲 𝐒𝐄𝐎 𝐒𝐡𝐨𝐮𝐥𝐝 𝐂𝐚𝐫𝐞) (My thoughts on how this will cleave semantic search going forward) MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings) represents a paradigm-shifting breakthrough that solves the fundamental scalability challenges of multi-vector embeddings while preserving their superior semantic understanding capabilities. This Google Research innovation transforms complex multi-vector similarity calculations into simple dot product operations, enabling sophisticated semantic search at web scale without prohibitive computational costs[1][2][3]. Key Technical Breakthrough: Transforming Multi-Vector to Single-Vector MIPS MUVERA's core innovation lies in Fixed Dimensional Encodings (FDEs) - a mathematically elegant approach that converts variable-length multi-vector embeddings into single, fixed-size vectors whose inner product approximates the original multi-vector similarity[1][2][3]. This transformation enables the use of highly optimized Maximum Inner Product Search (MIPS) algorithms, leveraging decades of algorithmic optimization for efficient retrieval[4][5]. The algorithm operates through a sophisticated four-step process: LSH-based partitioning using SimHash, representative sub-vector creation through aggregation, multiple repetitions for robustness, and concatenation into fixed-dimensional encodings[1][2]. This data-oblivious approach provides theoretical guarantees for approximation quality while maintaining consistency across diverse datasets and applications. Performance Achievements and Real-World Implementation MUVERA delivers remarkable performance improvements across multiple dimensions. On the BEIR benchmark suite, it achieves an average of 10% higher recall compared to previous state-of-the-art systems while simultaneously reducing query latency by 90%[1][6][3]. Memory footprint reductions of approximately 70% make multi-vector approaches viable for organizations previously constrained by infrastructure costs[7][8].

New comment Jul 5

Guerin Green

Jun 27 •

Research

ᴏɴ ᴛʀɪɢɢᴇʀɪɴɢ ᴀɪ ᴏᴠᴇʀᴠɪᴇᴡꜱ ᴀɴᴅ ᴍᴇᴀꜱᴜʀɪɴɢ ᴀɪ ᴏᴠᴇʀᴠɪᴇᴡ ᴏᴜᴛᴘᴜᴛꜱ

Technical research on attention mechanisms reveals that KV caches (key-value caches) can be reused across multi-turn conversations to reduce computational overhead. AttentionStore research demonstrates that reusing attention computations can decrease time to first token by up to 88% and improve prompt prefilling throughput significantly. However, this optimization occurs at the infrastructure level rather than creating persistent context across API calls. Each call still requires explicit context management from the application developer’s perspective. The long and short of this— beyond the non-deterministic nature of AI output, repeated queries “poison” the models thru these two mechanisms. Attention management, both explicit and likely implicitly (via inferred RL mechanisms) creates massive problems for tool reliability. And this, particularly KV caching, is difficult to quantify except in probabilistic terms. ʟᴏɴɢ ᴄᴏɴᴛᴇxᴛ ≠ ᴄᴏɴᴛᴇxᴛ ᴛʀᴀɴꜱꜰᴇʀ: ᴍᴏᴅᴇʟꜱ ʟɪᴋᴇ ɢᴇᴍɪɴɪ 1.5 (1ᴍ ᴛᴏᴋᴇɴꜱ) ᴇxᴄᴇʟ ᴀᴛ ɪɴᴛʀᴀ-ᴛᴀꜱᴋ ᴄᴏᴍᴘʀᴇʜᴇɴꜱɪᴏɴ ʙᴜᴛ ᴏꜰꜰᴇʀ ɴᴏ ᴄʀᴏꜱꜱ-ᴄᴀʟʟ ᴄᴏɴᴛɪɴᴜɪᴛʏ ᴡɪᴛʜᴏᴜᴛ ᴏʀᴄʜᴇꜱᴛʀᴀᴛɪᴏɴ . ᴀᴘɪ ᴄᴀʟʟ ᴄᴏɴꜱɪꜱᴛᴇɴᴄʏ: ᴘᴀʀᴀʟʟᴇʟ ʀᴇQᴜᴇꜱᴛꜱ ᴜɴᴅᴇʀ ᴏɴᴇ ᴋᴇʏ ᴍᴀɢɴɪꜰʏ ɴᴏɴ-ᴅᴇᴛᴇʀᴍɪɴɪꜱᴍ, ᴀꜱ ᴄᴏɴꜰɪʀᴍᴇᴅ ʙʏ ᴏᴘᴇɴᴀɪ ᴄᴏᴍᴍᴜɴɪᴛʏ ʀᴇᴘᴏʀᴛꜱ . Learn the practical implications of this at the Darkest AI Mastermind. nov.link/DarkestAI July 31-Aug 3 (Wisconsin and virtually)

Guerin Green

Jun 26 •

Research

Major new paper on HUMAN LIKE THINKING

FIND the analysis in the Bleeding Edge Classroom

Guerin Green

Jan 28 •

Research

Evaluating the DeepSeek tech stack with a critical eye

We've obtained and evaluated a pre-print DeepSeek Technical Report.... DeepSeek-V3: Core Contributions and Characteristics This report details DeepSeek-V3, a Mixture-of-Experts (MoE) language model with a total of 671 billion parameters, where 37 billion are activated for each token. The model's design prioritizes efficient inference and cost-effective training, incorporating specific architectural components and training strategies. Architectural Innovations: - Multi-head Latent Attention (MLA): DeepSeek-V3 utilizes MLA, which aims to reduce Key-Value (KV) cache during inference through a low-rank compression for attention keys and values. This technique involves compressing the latent vectors for queries, keys, and values, which can be cached during inference. The caching significantly reduces the memory footprint while maintaining performance. - DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: The model employs the DeepSeekMoE architecture, using finer-grained experts and isolating some as shared. It introduces an auxiliary-loss-free load balancing strategy to minimize performance degradation caused by imbalanced expert load, which occurs with MoE training. This strategy avoids using conventional auxiliary losses and instead employs a dynamic bias term added to affinity scores to distribute the load. There is also a sequence-wise auxiliary loss to prevent imabalance within a single sequence. - Multi-Token Prediction (MTP): The model incorporates a multi-token prediction objective, extending the prediction scope to multiple future tokens at each position. The implementation uses sequential modules to predict additional tokens and keeps the causal chain at each prediction depth. During inference, MTP modules can be discarded to function normally, or used to improve latency via speculative decoding. Infrastructure and Training Framework: - Compute Infrastructure: DeepSeek-V3 was trained on a cluster equipped with 2048 NVIDIA H800 GPUs. GPUs are connected by NVLink within nodes and by InfiniBand (IB) across nodes. - DualPipe Algorithm: A pipeline parallelism method named DualPipe is used, which overlaps the computation and communication across forward and backward passes. This method divides the computation into components and rearranges them with manual adjustment to ensure that communication is hidden during execution. - Cross-Node All-to-All Communication: The authors implement custom kernels for cross-node all-to-all communication, leveraging IB and NVLink. A node-limited routing mechanism limits the number of receiving nodes for each token, using only 20 SMs to implement all-to-all communication. - Memory Saving Techniques: Several methods are employed to reduce memory usage, including recomputing RMSNorm and MLA up-projections during back-propagation, storing the exponential moving average (EMA) of model parameters in CPU memory, and sharing the embedding and output head between modules. - FP8 Training: The model leverages a fine-grained mixed precision framework using the FP8 data format to accelerate training and reduce GPU memory usage. Techniques are introduced to ensure high precision, including a tile-wise or block-wise quantization strategy to handle feature outliers, and a promotion of GEMM operations to CUDA cores. Also, they retain FP32 and BF16 for key components of the architecture.

New comment Jun 17

1-13 of 13

Burstiness and Perplexity

skool.com/burstiness-and-perplexity

Master AI use cases from legal & the supply chain to digital marketing & SEO. Agents, analysis, content creation--Burstiness & Perplexity from NovCog

Leaderboard (30-day)