The Prague Post - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.237141
AFN 74.993062
ALL 95.905331
AMD 434.524559
ANG 2.065306
AOA 1057.987231
ARS 1607.446256
AUD 1.667725
AWG 2.076747
AZN 1.962746
BAM 1.955687
BBD 2.318587
BDT 141.251869
BGN 1.972113
BHD 0.435637
BIF 3427.787043
BMD 1.153749
BND 1.482683
BOB 7.954542
BRL 5.931309
BSD 1.151144
BTN 107.228827
BWP 15.793159
BYN 3.411063
BYR 22613.472246
BZD 2.315187
CAD 1.605862
CDF 2653.621787
CHF 0.921613
CLF 0.026777
CLP 1057.293922
CNY 7.940789
CNH 7.934589
COP 4249.27911
CRC 535.6622
CUC 1.153749
CUP 30.574337
CVE 110.61564
CZK 24.526362
DJF 205.044069
DKK 7.472726
DOP 69.946012
DZD 153.486803
EGP 62.760107
ERN 17.306229
ETB 180.785117
FJD 2.582318
FKP 0.873584
GBP 0.871963
GEL 3.091939
GGP 0.873584
GHS 12.703069
GIP 0.873584
GMD 84.792715
GNF 10127.022016
GTQ 8.806493
GYD 240.93613
HKD 9.042176
HNL 30.701227
HRK 7.537094
HTG 151.086719
HUF 381.654842
IDR 19710.640809
ILS 3.635912
IMP 0.873584
INR 107.28128
IQD 1511.410645
IRR 1518102.386919
ISK 144.403527
JEP 0.873584
JMD 181.488766
JOD 0.817982
JPY 184.309093
KES 149.98777
KGS 100.89491
KHR 4629.419768
KMF 492.650099
KPW 1038.373455
KRW 1734.487842
KWD 0.357374
KYD 0.959345
KZT 545.498598
LAK 25336.319113
LBP 103306.802431
LKR 363.205388
LRD 212.577728
LSL 19.457961
LTL 3.406719
LVL 0.697891
LYD 7.355168
MAD 10.819276
MDL 20.255361
MGA 4800.74792
MKD 61.646527
MMK 2422.604667
MNT 4121.468919
MOP 9.293565
MRU 46.288209
MUR 54.248575
MVR 17.825125
MWK 2003.494341
MXN 20.509324
MYR 4.658852
MZN 73.793433
NAD 19.463083
NGN 1591.142947
NIO 42.377576
NOK 11.194364
NPR 171.563893
NZD 2.022544
OMR 0.443611
PAB 1.151134
PEN 3.953031
PGK 4.969256
PHP 69.507004
PKR 321.953344
PLN 4.270496
PYG 7446.635874
QAR 4.205532
RON 5.097488
RSD 117.354675
RUB 90.856938
RWF 1685.626681
SAR 4.331055
SBD 9.282184
SCR 17.183308
SDG 693.403247
SEK 10.926473
SGD 1.48285
SHP 0.86561
SLE 28.380904
SLL 24193.543421
SOS 659.392816
SRD 43.093683
STD 23880.266279
STN 24.863282
SVC 10.07242
SYP 127.563628
SZL 19.452053
THB 37.623599
TJS 11.033865
TMT 4.03812
TND 3.367832
TOP 2.777949
TRY 51.463948
TTD 7.809652
TWD 36.84377
TZS 2999.745978
UAH 50.416661
UGX 4318.751389
USD 1.153749
UYU 46.617316
UZS 14046.888698
VES 546.262108
VND 30391.468325
VUV 137.648602
WST 3.19159
XAF 655.913557
XAG 0.015932
XAU 0.000248
XCD 3.118063
XCG 2.074681
XDR 0.814838
XOF 655.904509
XPF 119.331742
YER 275.28207
ZAR 19.484795
ZMK 10385.125117
ZMW 22.245912
ZWL 371.506573
  • RIO

    -0.4400

    94.01

    -0.47%

  • RBGPF

    -13.5000

    69

    -19.57%

  • CMSC

    0.1400

    22.18

    +0.63%

  • BCE

    -0.1900

    24.26

    -0.78%

  • JRI

    0.1200

    12.73

    +0.94%

  • CMSD

    0.0900

    22.35

    +0.4%

  • BCC

    0.5500

    73.75

    +0.75%

  • BTI

    0.4300

    58.71

    +0.73%

  • NGG

    -0.9300

    87.06

    -1.07%

  • RELX

    0.0200

    33.61

    +0.06%

  • GSK

    -0.3200

    56.37

    -0.57%

  • RYCEF

    -0.2400

    15.75

    -1.52%

  • AZN

    -0.6600

    202.83

    -0.33%

  • VOD

    -0.0700

    15.14

    -0.46%

  • BP

    0.3600

    47.48

    +0.76%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

L.Bartos--TPP