fn(args, deps) — Bringing Order to Chaos Without Breaking Anything

18 Mar 2026

There’s a major disconnect in AI-assisted development right now. Most of the conversation assumes you’re building something new, or working from the kind of clean, stable foundation that barely exists in real engineering teams.

The reality is that most engineering teams live in legacy systems under high load, with god classes, global singletons, and console.log as observability. The kind of code where every change is a gamble.

This post shows what happens when you apply fn(args, deps) and autotel to those codebases. fn(args, deps) creates the seam for safe change; production telemetry captures the behavioural record that survives when every other spec has decayed.

John Bercow shouting "Order!" in the House of Commons

To prove the point, we’ll do this in plain JavaScript, not TypeScript.

Companion repo: fn-args-deps-bringing-order-to-chaos-without-breaking-anything: every step is independently runnable, in plain JavaScript, with no TypeScript required.

Three Fields, Three Realities

Greenfield is what the demos show: net new code, explicit intent, and a spec you still trust.

Brownfield is where most teams actually work: an existing system, still evolving, where intent is only partially recoverable.

Blackfield is something else again: legacy systems under load, on a deprecation path everyone agrees on and nobody has time to execute.

This post starts in Blackfield. Features are still being bolted on because the business can't wait for the rewrite. Original authors are gone. Tests, if they exist, describe the system as it was. Documentation, if it exists, lies.

That isn't just a codebase with worse docs. It's a different engineering posture. You're not trying to improve something. You're trying not to break something while slowly making it possible to change. AI agents, trained to build forward and iterate toward green tests, are carrying the wrong assumptions into that room.

The Spec Problem Nobody Admits

Spec-driven development has a hidden premise: you know what your system is supposed to do.

In Greenfield, true. In Brownfield, partially true. In Blackfield, false.

Legacy systems accumulate behaviour nobody planned. Business rules encoded in conditionals that outlived everyone who understood them. When you ask an agent to work here without a spec, you haven't freed it from needing one. You've made the spec implicit, and the inferred spec will be wrong in ways that are hard to detect and expensive to discover.

The spec problem isn't solved by better prompts. It's solved by better sources of truth.

Here's what a typical legacy system looks like:

const { db } = require('./db.js');
const { paymentGateway } = require('./payment-gateway.js');
const { emailService } = require('./email-service.js');
const { inventoryService } = require('./inventory-service.js');

class OrderProcessor {
  constructor() {
    this.db = db;
    this.payment = paymentGateway;
    this.email = emailService;
    this.inventory = inventoryService;

    // Business rules hardcoded in the constructor.
    // Someone added these in a hotfix. Nobody documented why.
    this.discountRules = new Map([
      ['gold', 0.15],
      ['silver', 0.1],
      ['bronze', 0.05],
    ]);
    this.taxRate = parseFloat(process.env.TAX_RATE || '0.08');
  }

  async processOrder(customerId, items) {
    console.log(`[OrderProcessor] Processing order for ${customerId}`);
    const startTime = Date.now();

    const customer = await this.db.findCustomer(customerId);
    if (!customer) throw new Error(`Customer not found: ${customerId}`);

    const stockResult = await this.inventory.check(items);
    if (!stockResult.available) throw new Error('Insufficient inventory');

    const subtotal = items.reduce(
      (sum, item) => sum + item.priceInCents * item.quantity,
      0,
    );
    const discount = this.discountRules.get(customer.loyaltyTier) || 0;
    const discountedTotal = Math.round(subtotal * (1 - discount));
    const totalWithTax = Math.round(discountedTotal * (1 + this.taxRate));

    const paymentResult = await this.payment.charge(customerId, totalWithTax);
    if (!paymentResult.success)
      throw new Error(`Payment failed: ${paymentResult.error}`);

    await this.inventory.reserve(items);

    const order = {
      id: `ord_${Date.now()}`,
      customerId,
      items,
      totalInCents: totalWithTax,
      status: 'confirmed',
      createdAt: new Date(),
    };
    const savedOrder = await this.db.saveOrder(order);

    await this.email.send(
      customer.email,
      'Order Confirmed',
      `Your order ${savedOrder.id} for $${(totalWithTax / 100).toFixed(2)} has been confirmed.`,
    );

    console.log(
      `[OrderProcessor] Order ${savedOrder.id} completed in ${Date.now() - startTime}ms`,
    );

    return savedOrder;
  }
}

const orderProcessor = new OrderProcessor();
module.exports = { OrderProcessor, orderProcessor };

Four dependencies wired to global singletons. console.log as observability. Business rules in a constructor added during a hotfix. require() at the top of the file, no way to inject alternatives.

An AI agent will see a class with methods. It'll mock the constructor, write tests that pass, and produce output that looks correct. But it can't see the invisible contracts: the payment gateway that rejects amounts over $10,000. The discount rules that were a regulatory requirement. The call ordering that matters because the inventory service has side effects.

The agent has the code but not the contract.

It cannot see which paths are load-bearing in production, which edge cases still matter to real users, or which invisible constraints came from compliance, operations, or business reality rather than code aesthetics.

In legacy systems, the sequence is not write a spec, generate code, run tests. It is observe production, extract behaviour, lock it with tests, then change code safely.

Characterization Tests as Archaeology

Michael Feathers' Working Effectively with Legacy Code introduced characterization tests, written not to verify correct behaviour, but to document current behaviour. You capture what the system does so that when you change something, you can see what moved.

Not a goalpost. A tripwire.

In Blackfield systems, system tests matter for the same reason characterization tests do: they preserve behaviour when intent is no longer trustworthy.

A black-box test anchored to user-visible outcomes often survives the decay that kills lower-level tests tied too closely to implementation.

For our order processor, characterization tests are ugly. They have to be. The code gives us no other choice:

jest.mock('../src/db.js', () => ({
  db: {
    findCustomer: jest.fn(),
    checkStock: jest.fn(),
    reserveStock: jest.fn(),
    saveOrder: jest.fn(),
  },
}));
jest.mock('../src/payment-gateway.js', () => ({
  paymentGateway: { charge: jest.fn(), refund: jest.fn() },
}));
jest.mock('../src/email-service.js', () => ({
  emailService: { send: jest.fn() },
}));
jest.mock('../src/inventory-service.js', () => ({
  inventoryService: { check: jest.fn(), reserve: jest.fn() },
}));

Four jest.mock() calls before we've tested a single thing. In this shape of legacy code, module-level mocking is the only way in—until you create an explicit seam.

This is exactly the sort of environment where an agent can look productive while misunderstanding the system. It can satisfy mocked tests without discovering the behavioural contracts hidden in call ordering, side effects, and production traffic.

But now we can lock the contract:

it('returns a confirmed order with correctly calculated total', async () => {
  const order = await processor.processOrder(TEST_CUSTOMER_ID, TEST_ITEMS);

  // Characterize the current pricing math:
  //   subtotal = (2 * 1000) + (1 * 2500) = 4500
  //   gold discount = 15% -> Math.round(4500 * 0.85) = 3825
  //   tax = 8% -> Math.round(3825 * 1.08) = 4131
  expect(order.totalInCents).toBe(4131);
  expect(order.status).toBe('confirmed');
});

it('throws when payment fails — and does NOT reserve inventory', async () => {
  mockPayment.charge.mockResolvedValue({
    success: false,
    error: 'Amount exceeds limit',
  });

  await expect(
    processor.processOrder(TEST_CUSTOMER_ID, TEST_ITEMS),
  ).rejects.toThrow('Payment failed: Amount exceeds limit');

  expect(mockInventory.reserve).not.toHaveBeenCalled();
  expect(mockDb.saveOrder).not.toHaveBeenCalled();
});

These are not the same as old tests that may have drifted. Old tests describe what the system was designed to do; characterization tests, written now against observed behaviour, describe what it actually does.

A green test suite on a Blackfield system is not always confidence. Sometimes it is a fossil record.

If the code has a bug in the pricing math, these tests enshrine that bug. We fix bugs after we have safe refactoring coverage.

The point isn't quality. The point is: don't make things worse.

The Seam: fn(args, deps = defaultDeps)

Now we can move. The strangler fig pattern starts here: extract one function at a time from the god class. Give it explicit dependencies. Provide defaults so existing callers keep working.

I’ll keep everything in CommonJS (require() / module.exports) because that’s what most legacy Node.js codebases actually look like. The seam is the point, not the module syntax.

The defaultDeps object imports the existing singletons, so the class can delegate to this function without any caller knowing:

const defaultDeps = {
  findCustomer: (customerId) => db.findCustomer(customerId),
  checkInventory: (items) => checkInventory({ items }),
  chargePayment: (customerId, amount) =>
    chargePayment({ customerId, amountInCents: amount }),
  reserveStock: (items) => inventoryService.reserve(items),
  saveOrder: (order) => db.saveOrder(order),
  sendConfirmation: (email, orderId, totalInCents) =>
    sendConfirmation({ email, orderId, totalInCents }),
  discountRules: new Map([
    ['gold', 0.15],
    ['silver', 0.1],
    ['bronze', 0.05],
  ]),
  taxRate: parseFloat(process.env.TAX_RATE || '0.08'),
};

async function processOrder(args, deps = defaultDeps) {
  const { customerId, items } = args;

  const customer = await deps.findCustomer(customerId);
  if (!customer) throw new Error(`Customer not found: ${customerId}`);

  const stockResult = await deps.checkInventory(items);
  if (!stockResult.available) throw new Error('Insufficient inventory');

  const subtotal = items.reduce(
    (sum, item) => sum + item.priceInCents * item.quantity,
    0,
  );
  const discount = deps.discountRules.get(customer.loyaltyTier) || 0;
  const discountedTotal = Math.round(subtotal * (1 - discount));
  const totalWithTax = Math.round(discountedTotal * (1 + deps.taxRate));

  const paymentResult = await deps.chargePayment(customerId, totalWithTax);
  if (!paymentResult.success)
    throw new Error(`Payment failed: ${paymentResult.error}`);

  await deps.reserveStock(items);

  const order = {
    id: `ord_${Date.now()}`,
    customerId,
    items,
    totalInCents: totalWithTax,
    status: 'confirmed',
    createdAt: new Date(),
  };
  const savedOrder = await deps.saveOrder(order);
  await deps.sendConfirmation(customer.email, savedOrder.id, totalWithTax);

  return savedOrder;
}

module.exports = { processOrder };

The class can still exist and delegate. In most cases the public call sites don't need to change, though in real systems, watch for import timing and module side effects that can leak outward.

Now look at the tests:

function makeDeps(overrides = {}) {
  return {
    findCustomer: jest.fn().mockResolvedValue({
      id: 'cust_1',
      name: 'Alice',
      email: 'alice@test.com',
      loyaltyTier: 'gold',
    }),
    checkInventory: jest.fn().mockResolvedValue({ available: true }),
    chargePayment: jest.fn().mockResolvedValue({
      success: true,
      transactionId: 'txn_123',
    }),
    reserveStock: jest.fn().mockResolvedValue(undefined),
    saveOrder: jest.fn().mockImplementation(async (order) => order),
    sendConfirmation: jest.fn().mockResolvedValue(undefined),
    discountRules: new Map([
      ['gold', 0.15],
      ['silver', 0.1],
      ['bronze', 0.05],
    ]),
    taxRate: 0.08,
    ...overrides,
  };
}

it('processes a successful order with correct total', async () => {
  const deps = makeDeps();
  const items = [
    { productId: 'p1', name: 'Widget', quantity: 2, priceInCents: 1000 },
  ];

  const order = await processOrder({ customerId: 'cust_1', items }, deps);

  // subtotal = 2000, gold 15% => 1700, tax 8% => 1836
  expect(order.totalInCents).toBe(1836);
  expect(order.status).toBe('confirmed');
  expect(deps.chargePayment).toHaveBeenCalledWith('cust_1', 1836);
});

it('throws on payment failure — stock is never reserved', async () => {
  const deps = makeDeps({
    chargePayment: jest.fn().mockResolvedValue({
      success: false,
      error: 'Card declined',
    }),
  });

  await expect(
    processOrder({ customerId: 'cust_1', items: sampleItems }, deps),
  ).rejects.toThrow('Payment failed: Card declined');

  expect(deps.reserveStock).not.toHaveBeenCalled();
});

No module interception. No jest.mock(). No constructor ceremony. You mock exactly what the function uses by passing explicit test doubles as data.

The = defaultDeps is not laziness. It's a strangler fig — temporary scaffolding that buys you a test seam without breaking existing callers. New callers and tests get clean injection. When every caller has migrated, you remove the default. (The foundational post is explicit: defaults are allowed only for migration.)

For the full pattern, see If You Only Enforce One Rule for AI Code, Make It fn(args, deps).

Production Telemetry Is the Spec That Survived

When everything else rots (the documentation, the tests, the original ticket) production telemetry is often the most reliable surviving record of behaviour. Users kept showing up. They kept logging in, hitting endpoints, triggering workflows.

fn(args, deps) is the extraction pattern that makes legacy code testable; tracing and structured telemetry are what make that behaviour legible enough for humans and agents to change safely.

Charity Majors has argued for years that production is where you find out how your system actually behaves. Her framing is forward-looking: instrument as you build, observe as you deploy. But the same logic applies backward, to systems that nobody instrumented properly when they were young.

Telemetry is not a normative spec. It cannot tell you whether behaviour is correct. It can only tell you what behaviour survived contact with production. Production behaviour can encode accidents as readily as intent. But in a legacy system where every other artefact has decayed, observed behaviour is often the safer place to start.

The deeper question is not just how to observe production, but how to make telemetry queryable as a spec — not as a dashboard, but as a source of truth for safe change. fn(args, deps) gives us the seam to start adding that telemetry to systems that never had it.

Wrapping With trace()

Now that our functions have the shape fn(args, deps), we can wrap them with autotel's trace(). The business logic does not change. Existing tests often pass unchanged.

const { trace } = require('autotel');

const processOrder = trace((ctx) => async (args, deps = defaultDeps) => {
  ctx.setAttribute('order.customerId', args.customerId);
  ctx.setAttribute('order.itemCount', args.items.length);

  // ... exact same business logic ...

  ctx.setAttribute('order.totalInCents', totalWithTax);
  ctx.setAttribute('order.id', savedOrder.id);
  ctx.setAttribute('customer.loyaltyTier', customer.loyaltyTier);

  return savedOrder;
});

module.exports = { processOrder };

One-time setup in your entry point:

const { init } = require('autotel');

init({
  service: 'order-api',
  environment: process.env.NODE_ENV || 'development',
});

Nested traced functions create child spans automatically:

processOrder (span)
├── checkInventory (child span)
├── chargePayment (child span)
└── sendConfirmation (child span)

You wrapped it. You didn't rewrite it. The tests still pass. But now you can see it in production.

Wire at the Boundary

Once callers have been updated, remove the defaults and wire everything once in a composition root:

function createOrderService() {
  const deps = {
    findCustomer: (id) => db.findCustomer(id),
    checkInventory: (args) =>
      checkInventory(args, { checkStock: (items) => db.checkStock(items) }),
    chargePayment: (args) =>
      chargePayment(args, {
        charge: (id, amount) => paymentGateway.charge(id, amount),
      }),
    reserveStock: (items) => db.reserveStock(items),
    saveOrder: (order) => db.saveOrder(order),
    sendConfirmation: (args) =>
      sendConfirmation(args, {
        sendEmail: (to, subject, body) => emailService.send(to, subject, body),
      }),
    discountRules: new Map([
      ['gold', 0.15],
      ['silver', 0.1],
      ['bronze', 0.05],
    ]),
    taxRate: parseFloat(process.env.TAX_RATE || '0.08'),
  };

  return {
    processOrder: (args) => processOrder(args, deps),
  };
}

module.exports = { createOrderService };

The god class is gone. The singletons still exist in infra/ but they're wired at the boundary, not imported everywhere. The route handler stays clean:

const orderService = createOrderService();

app.post('/orders', async (req, res) => {
  try {
    const { customerId, items } = req.body;
    const order = await orderService.processOrder({ customerId, items });
    res.status(201).json(order);
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

The Spec You Can Query

Here's the payoff. Three layers of observability, each using real tools:

Span attributes via autotel: business context attached to every trace:

ctx.setAttribute('order.customerId', args.customerId);
ctx.setAttribute('customer.loyaltyTier', customer.loyaltyTier);
ctx.setAttribute('order.totalInCents', totalInCents);
ctx.setAttribute('payment.transactionId', paymentResult.transactionId);

Structured logging via pino: JSON log lines with scoped context, injected as a dependency:

const log = logger.child({
  operation: 'processOrder',
  customerId: args.customerId,
});

log.info(
  { customerName: customer.name, loyaltyTier: customer.loyaltyTier },
  'customer found',
);
log.info(
  { subtotalInCents, discountInCents, taxInCents, totalInCents },
  'totals calculated',
);
log.info({ transactionId: paymentResult.transactionId }, 'payment charged');
log.info({ orderId: savedOrder.id, totalInCents }, 'order completed');

Business event tracking via track(): analytics events for downstream consumers:

track('order.completed', {
  orderId: savedOrder.id,
  customerId: args.customerId,
  totalInCents,
  itemCount: args.items.length,
  loyaltyTier: customer.loyaltyTier,
  discountInCents,
});

Structured errors with OrderError: a plain JS class that tells callers why it failed and how to fix it:

class OrderError extends Error {
  constructor(opts) {
    super(opts.message);
    this.name = 'OrderError';
    this.status = opts.status;
    this.code = opts.code;
    this.why = opts.why;
    this.fix = opts.fix;
  }
}

throw new OrderError({
  message: 'Payment failed',
  status: 402,
  code: 'PAYMENT_FAILED',
  why: 'Payment processor declined',
  fix: 'Try a different payment method',
});

Together, traces, logs, events, and structured errors turn legacy behaviour from something implicit and anecdotal into something queryable.

Every log line is JSON. Every span has business attributes. Conceptually, once logs and trace attributes are structured, you can ask questions like:

-- Find all failed orders for gold customers
SELECT * FROM logs
WHERE loyaltyTier = 'gold'
  AND error.code IS NOT NULL;

-- Average order value by loyalty tier
SELECT loyaltyTier, AVG(totalInCents)
FROM logs
WHERE status = 'confirmed'
GROUP BY loyaltyTier;

The Sequence

For legacy systems, the sequence isn't:

Write a spec → generate code → run tests

It's:

Observe production → extract behavioural contracts → encode as tests → use those as the spec → then bring the agent in

Step	What Changes	Public API changes?	What You Get
0	Nothing. This is the mess	-	Starting point
1	Add characterization tests	None	Behavioural lock
2	Extract to `fn(args, deps = defaultDeps)`	Usually none	Testable seams
3	Wrap with `trace()`	None	Observability
4	Remove defaults, wire at boundary	Callers update at boundary	Clean architecture
5	Structured logging + actionable errors	Error shapes change	Observable system
6	UUID IDs, webhook notifier, config	None	Features on a clean codebase

Steps 0–5 are the refactoring. Step 6 is the payoff: three features land safely because the architecture supports them: UUID order IDs (one line change), a webhook notifier (new domain function + feature flag in the composition root), and centralized config. None of them touch existing domain code. None of them break existing tests.

That's the whole point. You don't refactor for the sake of refactoring. You refactor so that the next feature is easy.

The tools already exist: characterization tests, strangler figs, contract testing, observability. The legacy code community discovered them before AI entered the picture. The point is not to invent a new discipline for agents. It is to carry forward the disciplines that made legacy change survivable in the first place.

We hand agents the codebase, the tests, the docs, the specs, the commit history. But in legacy systems, the most trustworthy surviving artefact is often runtime behaviour itself.

Capture that behaviour as structured, queryable context, and the system becomes legible again — to humans first, and then to agents.

The ideas in this piece draw on Michael Feathers' Working Effectively with Legacy Code (characterization tests, seams, the legacy code change algorithm), Joost Visser's Building Maintainable Software (maintainability guidelines, loose coupling), and x(https://charity.wtf/).

Arrange Act Assert

Jag Reehals thinking on things, mostly product development