科学进步总是以葬礼为代价

科学进步总是以葬礼为代价

导读

This essay rethinks the side effects of generative AI by drawing a parallel with compression artifacts we're familiar with in digital life. It argues that the flaws in AI-generated content aren't from compressing training data, but from the 'expansion' process when models fill in missing information to produce outputs. The core idea is that these 'expansion artifacts' aren't just minor glitches—they act as forensic markers, shape new aesthetics, and pose systemic risks when they accumulate in AI feedback loops, even threatening to homogenize future AI training data. It's a super thought-provoking angle that makes you re-examine all the AI-generated content you encounter every day.

这篇文章用我们熟悉的数字压缩伪影做类比,重新审视了生成式AI的副作用。作者提出,AI生成内容的缺陷并非来自训练数据的压缩过程,而是来自模型补全缺失信息、生成输出的「扩展」环节。核心观点是这些「扩展伪影」不只是小故障:它们既能当作溯源的取证标记,也会催生新的审美风格,更会在AI互相喂料的反馈循环中不断累积,甚至会让未来AI的训练数据越来越同质化,带来系统性风险。角度非常新颖,看完会忍不住重新审视你每天碰到的所有AI生成内容。

The information age has been defined by bandwidth. The internet is limited by how much data we can squeeze into the narrow pipes of transmission infrastructure. So we invented *compression*, ways of representing the same object — a website, a picture, a song, a movie — within ever smaller digital footprints. YouTube, Spotify, Instagram, and the algorithms that make them work, wouldn't be possible without it.

信息时代的核心限制其实是带宽。互联网的天花板,本质上是我们能往传输基础设施的窄管道里塞多少数据。所以我们发明了「压缩」技术,用更小的数字存储空间来承载同样的内容——不管是网站、图片、歌曲还是电影。如果没有压缩技术,YouTube、Spotify、Instagram这些平台,还有支撑它们运行的算法,根本不可能存在。

Article illustration

From the very first studies into compression (at Bell Labs in the 1940s), researchers knew they'd have to accept a tradeoff: you can achieve smaller file sizes if you're willing to accept some loss of the original data. This seems counterproductive, since the whole idea is to reproduce the data, but scientists found ways to only discard information that is imperceptible to humans.

早在1940年代贝尔实验室刚开始研究压缩技术的时候,研究者就清楚这是个权衡游戏:只要你愿意接受原始数据的部分损失,就能得到更小的文件体积。这听起来好像违背了压缩的初衷——毕竟我们做压缩本来是为了完整还原数据,但科学家找到了巧妙的办法:只丢掉人类感知不到的信息。

Our ears and brains tend to ‘filter out’ quiet sounds overshadowed by loud sounds. MP3s take advantage of this blind spot by stripping out the quiet bits we wouldn't be likely to hear anyway.

我们的耳朵和大脑会自动「过滤」掉被 loud 声音盖住的微弱声响。MP3格式就是利用了这个感知盲区,把我们本来就不太可能听到的细微声音片段直接删掉,来缩小体积。

Our eyes and brains focus on the contrast between bright and dark shapes, reading the broad structure of images rather than granular details or tiny color variations. The JPG algorithm compresses files by throwing away information we don't tend to process.

我们的视觉系统更关注明暗形状之间的对比度,优先识别图像的整体结构,而不是细碎的细节或者微小的色彩差异。JPG算法就会把我们不太会注意到的信息丢掉,实现压缩。

Movies don't actually change that much frame-to-frame. The MPG algorithm carefully chooses key frames and saves the *relative motion* of each pixel, making movie files much smaller in the process.

电影的相邻帧之间其实变化并不大。MPG算法会挑选出关键帧,只保存每个像素相对于关键帧的运动变化,就能大幅缩减电影文件的体积。

A well-designed compression algorithm keeps data *perceptually* identical while making files much more efficient to store and transmit.

设计优秀的压缩算法,能在大幅提升存储和传输效率的同时,让数据在人类感知层面和原始内容几乎没有区别。

A poorly designed codec can go catastrophically wrong. In 2013, David Kriesel scanned a building floor plan on a Xerox WorkCentre and noticed that a room marked 21.11m² had become 14.13m². Xerox's implementation of the JBIG2 compression format saves space by quilting scans together from common, repeated elements; in Kriesel's scan, it had silently replaced the original numbers with ones from another part of the document it deemed visually similar enough. After Kriesel published, reports surfaced of the same silent substitution affecting building plans, invoices, and medical records.

但设计糟糕的编码解码器可能会带来灾难性的问题。2013年,David Kriesel用施乐WorkCentre扫描仪扫建筑平面图的时候发现,原本标注21.11平方米的房间,扫出来变成了14.13平方米。施乐用的JBIG2压缩格式会把扫描内容里的重复元素拼接在一起来节省空间;在这次扫描里,系统觉得原文的数字和文档另一部分的数字视觉上足够相似,就悄无声息地做了替换。Kriesel公布这个问题之后,越来越多类似的案例浮出水面:建筑图纸、发票、医疗记录都出现过这种被静默替换的情况。

Article illustration

Compression always changes data permanently. Common formats (JPG, MP3, MP4) make changes slowly and gently: it usually takes hundreds of cycles of saving, sharing, and re-uploading before the tool marks, called *compression artifacts*, become apparent.

压缩本质上是对数据的永久性改变。我们常用的JPG、MP3、MP4这些格式的改变是缓慢且温和的:通常要经过数百次保存、分享、重新上传的循环,那些被称为「压缩伪影」的工具痕迹才会变得明显。

and it goes blocky and washed out;

图片会变得满是色块、褪色发灰;

and metallic tones bleed through the music;

音乐里会多出刺耳的金属杂音;

and you end up with a blobby mess over unintelligible audio.

到最后就是糊成一团的画面配上根本听不清的音频。

Before compression: PSNR of infinity.

压缩前:峰值信噪比是无穷大(和原文件完全一致)。

After 10,000 compression cycles: PSNR of 14.59.

经过1万次压缩循环后:峰值信噪比只剩14.59(和原文件差距已经非常大)。

If you know what to look for and how to look for it, you can learn a lot about the path that data took to get to you. That's because compression artifacts are in turn **meta-information** ; you can learn something new about a document by identifying and cataloging its algorithm-induced flaws. Digital forensics uses this meta-information to explore the provenance of documents, photos and videos. Compression leaves breadcrumbs that betray whether or not a document has been edited (and often who, or what, edited it).

如果你知道怎么识别压缩伪影,就能摸清一份数据传到你手里之前走过的路径。因为压缩伪影本质上是「元信息」:通过识别和整理这些算法带来的缺陷,你能挖到文档背后的很多信息。数字取证领域就会用这些元信息来追溯文档、照片、视频的来源。压缩留下的痕迹会暴露一份文件有没有被编辑过,甚至经常能查出来是谁、或者什么工具编辑的。

Compression artifacts can even become an aesthetic of their own. Deep-fried memes dress images up in the aesthetics of pictures that have been shared and re-shared thousands of times. Datamoshing manipulates compression algorithms to create entirely new video aesthetics. Glitch music stretches and squashes audio files, making the tool marks of audio compression audible and even musical.

压缩伪影甚至能发展出独立的审美风格。「油炸梗图」就是模拟图片被反复转发数千次后的压缩痕迹效果;数据摩绪(datamoshing)技术通过操纵压缩算法创造出了全新的视频美学;故障音乐则会拉伸、挤压音频文件,把音频压缩的工具痕迹变成可听的、甚至有音乐性的元素。

Compression has spawned entire fields of art and science all in service of the ideal compromise between fidelity and file size.

压缩技术催生了一整个艺术和科学领域,所有研究都是为了在保真度和文件体积之间找到最理想的平衡点。

Three years ago, Ted Chiang proposed that LLMs are a lossy compression of their training data, which is itself a lossy sample of all the data available to it. But the artifacts we see in AI slop aren't in the compression. They're in the decompression.

三年前,特德·姜提出大语言模型本质上是对训练数据的有损压缩,而这些训练数据本身又是所有可用数据的有损采样。但我们在那些粗制滥造的AI内容里看到的缺陷,其实不是来自压缩过程,而是来自「解压缩」环节。

Every AI-generated output is an extrapolation from that blurry source, vectored toward your prompt, filling in plausible detail where the compression threw information away. The output gets inflated into blog posts and LinkedIn thoughtspam, software platforms, omnichannel advertising campaigns, and movie cameos from dead actors. Chiang compared the gaps and confabulations to compression artifacts.

所有AI生成的内容,都是从模糊的训练数据源出发,顺着你的提示词方向做外推,把压缩过程中丢掉的信息用看似合理的细节补全。这些输出被膨胀成博客文章、LinkedIn上的空话感想、软件平台、全渠道广告活动,甚至是已故演员在电影里的客串镜头。特德·姜把这些内容里的漏洞和虚构造假比作压缩伪影。

I think they're expansion artifacts.

但我觉得,这些应该叫「扩展伪影」才对。

What do expansion artifacts look like?

扩展伪影长什么样?

- LLMs produce text stuffed with hedging verbs and fuzzing adjectives (*delve, intricate, tapestry, multifaceted*). Their paragraphs are structured as miniature essays with setup, payoff, and a signposted takeaway (*This matters because…*).

- 大语言模型写的文本里塞满了模棱两可的动词和空泛的形容词(比如「深入探究」「错综复杂」「织锦」「多维度」这类词),段落结构就像微型作文:有铺垫、有结论,还有明确标出来的要点(比如「这一点之所以重要,是因为……」)。

- AI-generated code over-comments the obvious and creates error handlers for operations that can't logically fail.

- AI生成的代码会给显而易见的逻辑加过多注释,还给逻辑上根本不可能出错的操作写错误处理程序。

- Image generators have had their own tells: six-fingered hands, symmetrical-but-stylistically-objectionable jewelry, text that looks like text but only if you cross your eyes.

- AI图像生成器有自己的破绽:六根手指的手、对称但风格违和的首饰,还有眯着眼看才像文字的乱码字符。

- Video models struggle with continuity. Limbs appear and disappear, objects clip through each other, and physics sometimes just switches off.

- 视频生成模型在连续性上表现很差:四肢会突然出现或消失,物体会互相穿模,有时候物理规则直接失效。

Each of these artifacts is the training distribution leaking through where the model's confidence runs thin. Like compression artifacts, they double as forensic markers. In 2024, Stanford researchers tracked AI contributions to academic writing by watching for words whose frequency spiked after ChatGPT's release (*commendable, meticulous, pivotal, showcasing,* etc.). They estimated that 17.5% of recent computer science papers and 16.9% of peer review text contain AI-drafted content. Sometimes the tells are less subtle: one paper in an Elsevier journal opened with "Certainly, here is a possible introduction for your topic."

这些伪影本质上都是模型信心不足的时候,训练数据分布漏出来的痕迹。和压缩伪影一样,它们也可以当作取证标记。2024年,斯坦福的研究者通过追踪ChatGPT发布后使用频率飙升的词(比如「值得称赞的」「细致的」「关键的」「展示」等),来统计学术写作里AI的参与度。他们估计近期17.5%的计算机科学论文、16.9%的同行评审文本里都有AI起草的内容。有时候破绽甚至更明显:爱思唯尔旗下某期刊的一篇论文,开头直接就是「当然,以下是为你的主题撰写的可能的引言」。

Expansion artifacts will become aesthetic choices, too. DALL-E's uncanny food photography is my favorite, the kind of insane imagery that only an LLM would create. Power users of AI website generators (AI-pilled designers) already know how to recognize the tool marks, if only to try to prompt them away: purple gradients are an especially common tell. But as more and more non-designers use tools like Claude Design to prompt their way to fully-functional software products, I expect to see a *preference* for the aesthetic convergence endemic to the current crop of AI models.

扩展伪影也会变成一种审美选择。我最喜欢的是DALL-E生成的诡异美食摄影,那种只有大语言模型才能做出来的疯癫画面。AI建站工具的资深用户(被AI洗脑的设计师)已经知道怎么识别这些工具痕迹,甚至会想办法通过提示词去掉它们:紫色渐变就是特别常见的AI特征。但随着越来越多非设计师用户用Claude Design这类工具,靠提示词生成功能完整的软件产品,我预计大家反而会开始偏好当前这批AI模型特有的同质化审美风格。

I'd like to formally apologize for making every button in Tailwind UI `bg-indigo-500` five years ago, leading to every AI generated UI on earth also being indigo.

「我要正式道歉,五年前我把Tailwind UI里的所有按钮都设成了`bg-indigo-500`(靛蓝色),导致现在全世界所有AI生成的UI都是靛蓝色的。」——Tailwind作者Adam Wathan在推特上的玩笑

Expansion artifacts get genuinely dangerous when they compound, when one AI generation becomes the input to another, and another, and another. In February, an autonomous openclaw agent harassed a maintainer of the popular matplotlib Python library, for rejecting its code. Benj Edwards then reported the story for Ars Technica, but used AI to help him write; unsurprisingly, the article contained multiple factual errors about the incident.

当扩展伪影不断累积,一个AI的生成结果变成另一个AI的输入,如此反复循环的时候,就会产生真正的危险。今年2月,一个自动编程代理因为自己提交的代码被拒绝,不断骚扰热门Python库matplotlib的维护者。后来Benj Edwards为Ars Technica报道这件事的时候用了AI辅助写作,不出所料,文章里出现了多处和事件相关的事实错误。

This kind of Gell-Mann Amnesia for expansion artifacts leads to runaway feedback loops:

这种对扩展伪影的「盖尔曼健忘症」(指人们在自己熟悉的领域能发现媒体错误,却在不熟悉的领域相信媒体可信)会导致失控的反馈循环:

1. A CEO dictates a five-minute voice memo

1. CEO录了一段5分钟的语音备忘录

2. Claude expands it into a strategy doc

2. Claude把它扩展成一份战略文档

3. Notion's AI turns the strategy doc into product specs

3. Notion AI把战略文档转换成产品需求文档

4. Cursor vibe-codes a prototype

4. 代码编辑器Cursor靠「氛围编码」写出原型

5. Devin gives feedback on the PR

5. AI程序员Devin给代码合并请求提反馈

6. ChatGPT writes the launch copy

6. ChatGPT写发布文案

7. Intercom's Fin support agent fields support questions.

7. Intercom的Fin智能客服负责解答用户问题。

Every stage interpolates the previous context with data pulled from the blurry JPEG of its training distribution.

每一个环节都会用自己模糊得像JPG一样的训练数据,对上一步的上下文做插值补全。

The real danger happens when expansion artifacts show up in the training data for the next generation of generative AI. While anomalous tokens like 'SolidGoldMagikarp' broke early models open and spread their guts out on the operating table, new models get rigorously evaluated, making it harder to see where the errors are hiding. The long tail in the distribution (quiet voices, the weird and novel phrasings, unusual and challenging ideas) fade as each successive model converges toward a homogenized — perhaps hallucinated — center. The blurry JPEG gets blurrier every cycle, leaving more and more room for nonsensical and false material to fill the voids between the tokens.

真正的危险在于,这些扩展伪影会进入下一代生成式AI的训练数据。早期的「SolidGoldMagikarp」这类异常token能直接把模型搞崩溃,把内部逻辑完全暴露出来,但现在的新模型经过了严格的评估,你很难发现错误藏在哪里。训练分布里的长尾内容(小众的声音、新奇独特的表述、不同寻常的挑战性观点)会逐渐消失,每一代模型都会朝着同质化的、甚至可能是虚构的「中心」不断收敛。那张模糊的JPG每循环一次就会更糊一点,给无意义和虚假内容留下越来越多的空间,填补在token之间的空隙里。

Compression made the information age possible by stripping things down to fit the pipes. Expansion made the AI age possible by blowing data back up again. Both operations leave marks; we've learned to spot compression artifacts, but we've only just begun to reckon with expansion artifacts. Until we do, there's a lot of risk to manage.

压缩技术把内容精简到能塞进传输管道,才让信息时代成为可能。扩展技术把数据重新放大,才让AI时代成为可能。这两种操作都会留下痕迹:我们已经学会了识别压缩伪影,但我们才刚开始要认真应对扩展伪影。在我们真正搞懂它之前,还有很多风险需要处理。


来源:https://mattstromawn.com/writing/expansion-artifacts/