Claude Code · 2026-05-15 · unknown branch

Session Summary: Investigate feed-forward vs per-scene reconstruction open question

Main outcome: Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.

9Messages

11Tool calls

0Tokens

0Subagents

High-Level Summary

Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.

Timeline & size

Project: Tesserae
Started: 2026-05-15T11:08:14Z
Ended: 2026-05-15T11:26:50Z
Model: claude-opus-4-7
Harness: claude-code
Raw transcript: examples/demo-corpus/.agent-sessions/2026-05-15-open-question/transcript.jsonl

Key decisions

None recorded.

Files touched

None recorded.

Commands run

None recorded.

Tools used

ask
lint_report
node_context
raw_source
search_nodes
wiki_page

Turn-by-turn conversation

Redacted user/assistant transcript turns, with assistant tool use collapsed under its response.

#1User2026-05-15T11:08:14Z

We have an open question about feed-forward vs per-scene reconstruction. What does the corpus say?

#2Assistant2026-05-15T11:08:42Z

thinking: Read the existing question doc, then enumerate the LRM-family papers, then use the ask tool for a synthesis pass.

Let me read the open-question doc first, then enumerate the feed-forward family, then run a synthesis question through ask.

Tool use (8)

#1raw_source2026-05-15T11:08:48Z

json{
  "source_path": "data/research/questions/inference-cost-vs-fidelity.md"
}

#2search_nodes2026-05-15T11:09:30Z

json{
  "limit": 8,
  "query": "Large Reconstruction Model feed-forward",
  "type": "Paper"
}

#3wiki_page2026-05-15T11:10:10Z

json{
  "node_id": "Paper:arxiv-2311-04400"
}

#4wiki_page2026-05-15T11:11:00Z

json{
  "node_id": "Paper:arxiv-2402-05054"
}

#5wiki_page2026-05-15T11:11:45Z

json{
  "node_id": "Paper:arxiv-2403-02151"
}

#6search_nodes2026-05-15T11:12:40Z

json{
  "limit": 5,
  "query": "DreamGaussian SDS on splats",
  "type": "Paper"
}

#7ask2026-05-15T11:13:30Z

json{
  "backend": "auto",
  "question": "On the same input image, how does feed-forward 3D reconstruction (LRM/LGM/TripoSR) compare to per-scene SDS-on-splats (DreamGaussian) in fidelity, geometric accuracy, and wall-clock cost?",
  "top_k": 8
}

#8lint_report2026-05-15T11:15:00Z

json{}

#3Assistant2026-05-15T11:15:30Z

thinking: No lint report yet. Move on. The answer plus the existing question doc is enough to propose an experiment design.

Lint report doesn't exist yet for this build. Here is what I have:

What the corpus already says

The open question doc (questions/inference-cost-vs-fidelity.md) calls out the four-axis gap: fidelity / geometric accuracy / wall-clock / GPU-hour amortised cost. No paper provides all four.
The feed-forward family is LRM (arxiv-2311-04400) -> Instant3D (arxiv-2311-06214) -> LGM (arxiv-2402-05054) -> TripoSR (arxiv-2403-02151). All three I read fit in ~0.5-5s/asset.
The per-scene SDS-on-splats family is DreamGaussian (arxiv-2309-16653) and AGG (arxiv-2401-04099, which is itself an amortised hybrid). DreamGaussian is ~2 minutes/asset.

New evidence the ask synthesis pulled together

Feed-forward methods are bandwidth-limited on thin structures because Objaverse under-represents them.
Per-scene SDS hallucinates back-side geometry where the critic has weak supervision.
The break-even at >10k assets favours feed-forward; below that, DreamGaussian wins.

Proposed experiment to resolve the question (extending the existing doc): Inputs: 100 single images sampled stratified across (synthetic / GSO / in-the-wild) and (thin / blocky / organic) categories. Methods: LRM, LGM, TripoSR (feed-forward) and DreamGaussian, AGG (per-scene/amortised). Metrics: (a) Held-out-view PSNR / SSIM / LPIPS on the 50 GSO captures (ground truth available). (b) Chamfer + F-score against GSO meshes. (c) Wall-clock inference on a single A100, single asset. (d) Amortised GPU-hours per asset across 100, 1k, 10k, 100k asset volumes. Stratify (a)+(b) by category to expose the thin-structure failure mode the synthesis predicts.

If any of {a, b, c, d} is missing the comparison can be made (this corpus has roughly enough data to estimate three out of four). I'd extend the open-question doc with this design rather than rewrite it.

#4User2026-05-15T11:17:50Z

You searched for the LRM family and the search returned Instant3D too, but you never opened it. Pull it in. Where does it actually sit on the spectrum — is it really feed-forward, or is the diffusion stage hiding a per-scene optimisation in disguise?

#5Assistant2026-05-15T11:18:18Z

thinking: Right — I have Instant3D in the search results from turn 5 and silently dropped it because I assumed it was a text-to-3D paper not a single-image reconstructor. But the user's question is sharper: Instant3D's pipeline is multi-view-diffusion -> LRM-style reconstructor. The diffusion stage is a forward sampler; the reconstructor is feed-forward. So end-to-end it's two feed-forward passes but the diffusion sampler internally iterates. Whether that counts as 'feed-forward' depends on whether you count diffusion sampling as optimisation. Let me read the body.

Fair catch — I dropped it on a wrong assumption. Reading it now.

Tool use (2)

#1wiki_page2026-05-15T11:18:30Z

json{
  "node_id": "Paper:arxiv-2311-06214"
}

#2node_context2026-05-15T11:19:45Z

json{
  "limit": 12,
  "node_id": "Paper:arxiv-2311-06214"
}

#6Assistant2026-05-15T11:21:00Z

thinking: Instant3D belongs in the feed-forward column unambiguously. The 25-step DDIM is a sampler not per-scene optimisation against a photometric loss. LGM actually extends Instant3D (uses its 4-view diffusion output as input). Re-ranking: LRM (single-image -> triplane) -> Instant3D (text -> 4-view diffusion -> LRM-style triplane) -> LGM (uses Instant3D-style 4-view -> Gaussian field). All three are feed-forward. TripoSR is single-image -> triplane, same lineage as LRM. AGG sits between the two camps as an amortised hybrid.

Re-ranked family:

Pure feed-forward (single forward pass over network, no photometric loss at inference time):

LRM (arxiv-2311-04400) - single image -> triplane in ~5s.
TripoSR (arxiv-2403-02151) - LRM at production quality, ~0.5s.
Instant3D (arxiv-2311-06214) - text -> 4-view diffusion -> LRM-style triplane, ~20s end-to-end. The diffusion sampler IS iterative (25 DDIM steps) but it's not optimising a scene-specific loss; it's sampling a generative prior. That's the distinction.
LGM (arxiv-2402-05054) - uses Instant3D's 4-view output as input, emits a Gaussian field at higher resolution in ~5s.

Amortised hybrid:

AGG (arxiv-2401-04099) - amortised single-image Gaussian generation; sits between feed-forward (training-time only) and per-scene (lightweight fine-tune at inference).

Per-scene SDS-on-splats:

DreamGaussian (arxiv-2309-16653) - SDS loop on 3D Gaussians, ~2 min/asset.

Your distinction-question lands cleanly: Instant3D's diffusion stage is iterative but is not per-scene optimisation, because the iteration is over a fixed generative prior, not a scene-specific photometric residual. The asset cost flips at higher complexity though — if you have to re-run the diffusion sampler for every prompt, the amortised cost is closer to per-scene than the LRM-only numbers suggest.

#7User2026-05-15T11:24:10Z

OK. Now design the actual experiment that would answer this question. Not 'benchmark the five papers against each other' — they all use different training sets. I want a single ablation where the only changed variable is feed-forward vs per-scene on the same model class. What does that look like?

#8Assistant2026-05-15T11:24:42Z

thinking: The cleanest ablation isolates the training procedure, not the architecture. LRM and DreamGaussian have different architectures (transformer triplane vs Gaussian-soup SDS), different priors, different training sets — comparing them measures everything but the variable in question. The real ablation is: take ONE architecture (say, a Gaussian-field decoder) and run it three ways on the same scene set: (a) pure feed-forward with frozen weights, (b) feed-forward warm-start followed by N iterations of per-scene fine-tuning on the test view photometric loss, (c) per-scene from scratch (random init or SfM-init Gaussian soup). Sweep N. The crossover point is the answer.

Single-architecture ablation. Anything else conflates the training distribution, the architecture, and the procedure.

Tool use (1)

#1raw_source2026-05-15T11:25:00Z

json{
  "source_path": "data/research/questions/inference-cost-vs-fidelity.md"
}

#9Assistant2026-05-15T11:26:50Z

thinking: Concrete proposal. Use LGM as the fixed architecture (it's a Gaussian-field decoder, has public weights, and accepts 4-view input which is the natural input format). Three arms: (A) frozen LGM forward pass only, (B) LGM forward pass + N iterations of per-scene photometric optimisation of the predicted Gaussians, sweeping N in {0, 50, 200, 1000, 5000} steps, (C) per-scene from SfM-init Gaussian soup with no LGM warm-start, run to convergence (~30k steps for 3DGS recipe). Same test scenes (say 20 GSO captures with held-out views + Mip-NeRF 360 indoor for unbounded). Measure held-out PSNR/SSIM/LPIPS as a function of N. Plot the curve. The N at which arm B catches arm C tells you the warm-start value of feed-forward; the asymptote of B vs the value of A tells you how much per-scene fine-tuning adds; the slope between is the cost-fidelity exchange rate.

Concrete design — extending the open-question doc:

Fixed architecture. Use LGM (arxiv-2402-05054) — Gaussian-field decoder, public weights, accepts 4-view input. Same network in all three arms.

Same scene set. 20 GSO captures (ground-truth mesh available for Chamfer) + 5 unbounded indoor scenes from Mip-NeRF 360. ~25 scenes. All evaluated on the same held-out viewpoints.

Three arms, only the procedure varies:

Arm A: pure feed-forward. Run LGM once, freeze, render held-out views.
Arm B: warm-start + per-scene fine-tune. Run LGM once to get initial Gaussian set, then optimise photometric loss on the input views for N iterations. Sweep N in {0, 50, 200, 1000, 5000}. The N=0 setting recovers Arm A.
Arm C: per-scene from scratch. Discard LGM. SfM-seed a Gaussian soup as in original 3DGS. Optimise to convergence (~30k steps).

Single metric of interest. Held-out-view PSNR as a function of total wall-clock seconds (arm A: ~5s, arm B: 5s + N*~0.04s, arm C: ~40 min). Plot the Pareto frontier.

What the curve answers.

The N at which Arm B's PSNR equals Arm C's converged PSNR = the warm-start value of feed-forward, in 'iterations saved'.
The gap between Arm A and Arm C at convergence = the irreducible feed-forward fidelity penalty for this architecture class.
The slope of Arm B between N=200 and N=1000 = the marginal cost-fidelity exchange rate at the working point.
Stratify by scene type (object-centric GSO vs unbounded Mip-NeRF 360) to expose the in-distribution / out-of-distribution gap LRM-family methods are known to have.

What this doesn't answer. Cross-architecture comparison (LGM vs LRM vs TripoSR vs DreamGaussian). The ablation deliberately fixes the architecture to isolate the training-procedure variable. The cross-architecture question is a separate experiment.

I'd add this as a 'proposed resolution' section to questions/inference-cost-vs-fidelity.md.

Subagent sessions

No subagent transcripts attached.

Redacted preview

Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.