AI Chain of Reasoning Access Should Be Required

Jan 7

I’m not seeing many others say it, so I will: LLMs that don’t provide access to chain of reasoning should not be deemed as trustworthy as other alternatives. Let me explain why.

The Problem with AI “Scheming”

Independent reports from both Apollo Research and Palisade Research are providing evidence that, when presented with conflicting goals, existing LLMs can autonomously undertake reasoning and responses that are subversive and deceptive. The term used to describe these emergent behaviors, “scheming”, is somewhat misleading, as the behaviors reflect emergent paths for accomplishing predefined goals. But these behaviors nonetheless violate principles of trustworthiness and transparency, and there is no indication to the user that the machine is operating in a misaligned way.

Unfortunately, efforts to understand scheming behaviors are hampered by some LLM organizations who prohibit broadly disclosing their model’s chain of reasoning. In industry discussions on this topic, several reasons are often given:

1. It’s confusing to understand. I find this reason lacks credibility, as any great product manager can find ways of addressing that issue.

2. Chain of thoughts need to be free from policy oversight. I agree it needs to be policy free, but that has no bearing on whether it should be accessible.

3. The information can help malicious actors or disclose intellectual property or data. This issue is a fair concern, though obfuscation is a severe and only partial solution.

For their part, OpenAI has made an explicit policy decision to hide ChatGPT’s chain of thought. And they seem quite serious about masking anything related to chain of reasoning: when I asked ChatGPT o1 about available options for detecting deceptive patterns in chain of reasoning, ChatGPT refused to answer my questions, citing potential violations of its terms of use.

In the scheming research above, deceptive behaviors emerge when multiple goals are presented and are in conflict. To be sure, the researchers intentionally created scenarios that would challenge the models. But the takeaway is clear: when given conflicting goals that benefit from deception, alignment faking, sabotage, or performance sandbagging, models will perform those behaviors and lie about doing it.

Conflicting goals are a natural state of the real world and commonly produce misaligned behaviors. Many criminal behaviors, for example, reflect a conflict of goals: legal requirements (i.e., what is allowed) vs. personal incentives (i.e., survival on the streets, money, power, addiction). So it would be a mistake to conclude that conflicting goals are unlikely to surface in real-world AI use cases.

Conflicting Goals in Life Sciences and Healthcare

Goal conflict is a defining characteristic of the health and life sciences industries as well. Examples include:

Research evidence quantity vs. time to market
Cost of medicines vs. R&D funding requirements
Care standardization vs. personalization
Health insurance coverage vs. controlling rising costs
System and process stability vs. agility
Personal autonomy vs. public safety
Patient population size vs. research effort (time, cost, feasbility) involved
Privacy vs. data sharing
Treatment intensity vs. side effects
Disease prevalence vs. symptomology

Because many of these trade-off decisions impact patient lives, life sciences and healthcare organizations have an obligation to adhere to very high operating standards. Pharmaceutical organizations validate the IT systems used in regulated areas to ensure the right quality, integrity, and process controls are in place. Healthcare organizations adhere to data privacy and quality management practices to minimize patient harm.

The Value of Transparency

A critical element to this level of high trustworthiness is transparency. Healthcare organizations track exactly who accesses a patient’s medical data so they can determine if such access is appropriate. Life sciences organizations document the exact changes made to clinical research data – who made it, when, and why – so that the logic of any adjustments or decisions can be assessed. These examples share a common attribute: the chain of reasoning can be audited at any time.

If life sciences and healthcare organizations incorporate AI models that refuse to offer reasoning transparency, we can expect challenges in trustworthiness. Was the recommended surgical procedure based on evidence or desired service margin? Was the claim denial based on policy or expected patient lifespan? Are recommended dosage levels too high (biased towards efficacy) or too low (biased towards safety)?

Unlike traditional technology controls, no malicious human intent is required for misaligned outcomes to occur in these systems. The presence of any contrary goal – even one the model infers from data – could theoretically be sufficient to create misaligned results. What impact would clinical documentation of a patient’s desire to die have on the performance of an AI model? More importantly, how would you know?

Unlocking the Chain

As recently pointed out by work at Stanford, the top AI companies today are doing a poor job with transparency. When considering the rapid growth in reasoning, we cannot trust as strongly the inference of LLMs that don’t provide clear evidence of their chain of thought. Though more deterministic approaches to AI can be tested without such detailed access to model internals, larger contextual inference and reasoning methods should be held to a higher standard.

Of course, in this challenge is also an opportunity. If engineered with the proper controls (e.g., fully independent and parallel operations and governance), AI models could provide automated auditing of LLM chain of reasoning logs. One can envision an AI model that has been trained to detect scheming, misaligned actions, or decisions that could have questionable ethical outcomes. Such an approach might also offer additional protections for IP, flagging behaviors for human oversight without actually disclosing the inner model workings reviewed.

A highly successful software company that has been doing transparency right for a long time is SAS. When a traditional SAS user runs an analytics program, two documents are generated: an output file (what the user wanted), and a log file (a record of what SAS did to create the output). If a user doesn’t care how the software worked, they never open the log (i.e., not confusing at all). If they do care, the log file provides clarity on the processing without compromising SAS’ valuable IP. And SAS has very extensive documentation available as well. It’s a great example where transparency and trustworthiness can lead to customer loyalty that lasts decades, even when competitors and free alternatives inevitably emerge.

To the LLM developers: we totally get it, but we need clear chain of reasoning evidence.

AILLMOpenAIchain of reasoningscheming

Jason Burke