12/29/2025
soo first i'm a 🦫! <33 yay i'm so grateful for all the opportunities and i will continue working hard and discovering new things!
interesting things in the quanta end of the year recaps
they're basically just past articles from the year
1. "little red dots": qs01, a naked black hole discovery! qs01 dominates the mass of its surroundings and is mostly made up of hydrogen and helium, so this could be a contradiction to the idea that galaxies form before the black holes. which leads to the question of how did these black holes come about, were they produced during the big bang? they're going to test predictions using nasa's LISA!
2. "weakening dark energy": researchers are getting more evidence for this phenomenon which could contradict the lambda-cdm model which assumes a cosmological constant.
3. "geophysics of earth's core": i kinda linked this article before but didn't explain it. earth's core and mantle seem to be interacting without a rigid barrier, with the presence of large low shear velocity provinces (llsvp) where seismic waves contrarily didn't slow down, meaning they have large grain sizes. is is theorized that these llsvps help the mantle interact with plates and whatnot.
4. impact of ai on physics: they can generate experiments that ppl wouldn't have thought of otherwise, but also create waste as ai slop is uploaded to arXiv...
math recap (ifl i've read each of their respective articles already): 1. hilbert's sixth problem - result showing that newton's microscopic equations converge to boltzmann's mesoscopic equation over long times (?) 2. spectral gap of hyperbolic surfaces 3. 3d kakeya conjecture
ill recap bio+phys recap when they release the video i guess?
notes for https://aisafety.dance/
pt1: the past, present, & possible futures
pre 2000s: ai w/ logic but w/o intuition: symbolic ai
this ai is good at playing chess (ibm's supercomputer deep blue for e.g.), but it can't recognize pictures of cats, since there's no step-by-step guide for that!
problems with ai logic: 1. how can we give ai logic "common sense"? 2. specification gaming, but the ironic thing is we want ai to come up with solutions that we don't expect 3. "instrumental convergence" --> almost all goals logically converge to the goal of amassing resources & resisting shutdown
after 2000: ai w/ intuition but w/o logic: deep learning
machine learning learns from data, more general
deep learning did start before b4 symbolic ai, artificial neural networks invented in 1943 by mcculloch and pitts, inspired by neurons (a.k.a. perceptrons), also called connectionist ai --> Mark I Perceptron (1960): image recognition w/ 3 layers.
symbolic vs connectionists led to dormant deep learning field, claims that it could never learn grammar, xor affair. but in 2010s, gpus became cheaper, and anns were beating records.
problems with ai "intuition": 1) learning human prejudices, algorithmic bias 2) the common sense is not actually common sense -- out of distribution error / robustness failures. under this umbrella: inner misalignment / goal misgeneralization - the ai can skillfully execute corrupted goals 3) lack of interpretability inside anns
today: race to "scale up" ai: do we get to this by scaling current methods? "evidence for a "technological singularity": moores law: transistors u can fit on a chip doubles every two yrs; ai scaling law: 1 million times more compute, error is halved in gpt. but these laws can't always hold: are we going to get transistors smaller than an atom? huang declares that moore's law is dead and instead for huang's law he says gpus will be the ones getting better instead! and at some point 1million times is a loott of money.
despite such negativity, 1. we can find new uses for current ai, like in medical fields 2. "tipping points" 3. we could rediscover a more powerful AI technique!
awkward alliances: 1. ai capabilities "versus" ai safety, but ai safety techniques can improve ai, like rlhf did. 2. near-risk "versus" existential-risk, different priorities basically
there's no conclusive conclusion for when agi will happen, but there are some theories. 1. AI goes to "infinity" in a finite amount of time --> converging infinite geometric series w/ r<1 for time it takes to reach each level. 2. exponential or slow takeoff --> exponential phenomena are observed irl 3, accelerating and then maybe stabilizing/decelerating --> diminishing returns 4. steady or decelerating takeoff: complexity of any real-world problem we care about increases over time
pt 2: the problems
value alignment problem: How do we get humans to robustly serve humane values?
1. goal mis-specification: goodhart's law: "When a measure becomes a target, it ceases to be a good measure"
finding another branch in causal diagram, optimization, wireheading (reward hacking or reward tampering) --> w/ security mindset need to fix
not that ai doesn't know what we want, it's that it doesn't care. "lebowski thm" - no superintelligent ai will do a task harder than hacking its reward function. and sometimes we want ai to disobey for our own good, have it "do what i would have wanted if knew outcome in advance"
2. instrumental convergence (only applies to advanced ai, studies have confirmed that as llm is scaled up, its "values" become more coherent and resistant to change) --> solves wireheading, you can't do goal x if you have no goals anymore.
3. lack of interpretability
4. lack of robustness (overfitting? "spurious correlations"? ontological crisis - ai learns a new model of the world)
5. algorithmic bias - current ai has no built-in sense of causality
6. goal mis-generalization (ai can do what you want in training but not irl, type of failure of goal robustness, worse than failure of capabilities robustness, humans face too like toxic environment giving high grades but unhealthy adult habits) vs goal mis-specification (ai doing exactly what you asked, not what you want)
7. what are humane values? humans aren't all humane. some meta-ethics: is morality objective? can ai discover it? virtue ethics (character), deontology (actions), consequentialism (results of actions) all have issues when applied to ai -- needing a moral axiom, tuned to human nature, rationality \neq morality. or if there's no objective morality, we'd follow social contract theory. or what if morality's not even real and we shouldn't "pretend" it is? idk this is so confusing
pt 3: the proposed solutions
there are a lot of linked papers but i won't link them, js ctrl f thru the article, they're all rlly interesting!
scalable oversight: get chains of overseers (as independent as possible, basically induction), if one fails, do a bunch of braided chains
- recursive reward modeling: uses a level-n bot to train a level-(n+1) bot w/ its goals and desires -- "don't imprison a monster, build something that you can actually trust!", which will help w/ value drift
- debate: 2 equally powerful ai's debate so that truth triumph over falsehood
- superfiltering: use a small, open-source ai to filter training data for larger ai -- won't learn jailbreaking, risky capabilities, or how to cheat on benchmarks
- weak-to-strong generalizaiton (gpt2 supervise gpt4) vs iterated distillation & amplification (ida) (strong ai becomes the ceo, successful in alphago: distill - have 2 copies play each other to learn intuition, amplify - plug "intuition module" into model w/ monte carlo tree search, iterate) issues: what if issues accumulate
future lives algorithm: goal is to get ai to apply security mindset to itself, assuming an optimal-capabilities ai. 1) human asks ai, 2) ai considers possible actions and their results, 3) ai predicts how current you would react to those futures, 4) does the actions you'd most approve of. "indirect normativity" - ask ai to learn & predict what we'd value. but this ties in emotions and whatnot. but ai would fix it itself, right? and it could modify ourselves. <-- "embedded agency", "tiling agents" problem (how to prove a property is maintained despite consistent self-modification)
fixing robustness problem: making ais know they don't know our true goals, use probabilities and uncertainty --> will ask for clarification
how to learn a human's values: 1) inverse reinforcement learning (irl): let ai observe what people choose to do (passive), 2) cooperative inverse reinforcement learning (cirl): human actively helps teach ai, 3) reinforcement learning from human feedback (rlhf): train "teacher" ai to imitate human rater, "teacher" ais train "text completion" ai to be helpful chatbot, 4) learn values alongside systematic irrationality. the higher an ai's capability, the better its alignment.
interpretability and steering
for ai "intuition": steering is using insights from interpretability to actually change ai intuition
methods (!! i'm alr familiar w/ yippee): feature visualization & circuits (the text that most predicts "good" is "got Rip Hut Jesus shooting basketball Protective Beautiful laughing" apparently, grokking (just an illusion?), probing classifiers (i.e. one probe per layer of ann to find where ann has processed info enough to do the subtask), sparse auto-encoders (predict activations to force "monosemantic" neurons, improvements like sparse crosscoders and jacobian saes), deliberative alignment and chain-of-thought (fragile by experts). greenblatt et al: claude can beat attempts to rewire it, owain et al:llms learn a "general evil-ness factor"
fragility of ai intuition: simplicity (regularization, autoencoders, speed/simplicity prior for honest ai, satisficers), diversity (kalman filters that take diverse noisy estimates into better estimate of true state, ensembles that let bunch of trained different nns take majority vote, dropout where network connections randomly dropped over training, shard theory (modern ai will go through robust alignment by learning from diverse ensemble of reward functions a.k.a. shards), data augmentation, diverse data, moral parliament (like ensembles but wrt moral theory)), adversity (adversarial training, relaxed/latent adversarial training (no specific input for "attacker" ai to trick "defender" ai so more general), red teams (red team break, blue team re-design, repeat, teams can be humans ai or mix), best worst-case performance)
The Illusion of Thinking, by Shojaee & Mirzadeh et al: surprisingly ai can only play tower of hanoi w/ small # of disks and does worse with more disks which is contradictory to how ppl learn the game gradually. ai is js bad at what's not common in training data, rlly bad progress on arc-agi
make it humane: constitutional ai, moral parliament, using ai to distill & amplify human values, coherent extrapolated/blended volition
ai governance: evals, protect whistleblowers' free speech, enforce transparency and standards on major ai labs, track chips & compute (but seems like issue rn is run-time (decentralized) not training), forecasting, responsible scaling policy (i.e. "We commit to not even start training an AI of Level N, until we have safeguards and evaluations up to Level N+1."), differential technology development
more proposals: alternatives to agi: comprehensive ai services (cais), tool ais, pure scientist ai, microscope ais, quantilizers
cyborgism - let's js be merry and happy and collaborate w/ ai! but it's kinda hard to be an efficient cyborg. and you may get modified. and what if you become a sociopath.
Comments
Post a Comment