knowledge/malik/extropic.json

{
    "EXTROPIC PAPER": {
    "author_metadata": "Extropic, a company focused on developing innovative technologies for sustainable computing.",
    "source_metadata": "Extropic Corporation research paper",
    "knowledge": [
      {
        "type": "fact",
        "insight": "The paper proposes an all-transistor probabilistic computer architecture called Denoising Thermodynamic Models (DTMs) that could achieve performance parity with GPUs while using approximately 10,000 times less energy on simple image benchmarks.",
        "content": "Asystem-levelanalysisindicatesthatdevicesbased on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.",
        "attributes": [
          {
            "name": "source",
            "value": "Extropic Corporation research paper"
          },
          {
            "name": "date",
            "value": "October 29, 2025"
          },
          {
            "name": "energy_efficiency_claim",
            "value": "10,000x improvement"
          },
          {
            "name": "benchmark_type",
            "value": "simple image benchmark"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "U.S. AI data centers could consume 10% of all U.S. energy production by 2030, and current annual spending exceeds the inflation-adjusted cost of the Apollo program.",
        "content": "Every year, U.S. firms spend an amount larger than the inflation-adjusted cost of the Apollo program on AI-focused data centers [1, 2]. By 2030, these data centers could consume 10% of all of the energy produced in the U.S. [3].",
        "attributes": [
          {
            "name": "source",
            "value": "U.S. energy consumption projections"
          },
          {
            "name": "date",
            "value": "2025 projection"
          },
          {
            "name": "energy_consumption",
            "value": "10% of U.S. energy by 2030"
          },
          {
            "name": "annual_spending",
            "value": "Exceeds Apollo program cost"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Current AI algorithms have evolved in a suboptimal direction due to the 'Hardware Lottery' problem, where hardware availability has influenced algorithm research.",
        "content": "Had a different style of hardware been popular in the last few decades, AI algorithms would have evolved in a completely different direction, and possibly a more energy-efficient one. This interplay between algorithm research and hardware availability is known as the 'Hardware Lottery' [19], and it entrenches hardware-algorithm pairings that may be far from optimal.",
        "attributes": [
          {
            "name": "source",
            "value": "Research analysis"
          },
          {
            "name": "problem",
            "value": "Hardware Lottery"
          },
          {
            "name": "consequence",
            "value": "Suboptimal algorithm evolution"
          },
          {
            "name": "sentiment",
            "value": "Critical"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Existing probabilistic computer approaches for EBMs suffer from the 'mixing-expressivity tradeoff' (MET), where better modeling performance leads to exponentially longer mixing times and higher energy costs.",
        "content": "The mixing-expressivity tradeoff (MET) summarizes this issue with existing probabilistic computer architectures, reflecting the fact that modeling performance and sampling hardness are coupled for MEBMs. Specifically, as the expressivity (modeling performance) of an MEBM increases, its mixing time (the amount of computational effort needed to draw independent samples from the MEBM's distribution) becomes progressively longer, resulting in expensive inference and unstable training [52, 53].",
        "attributes": [
          {
            "name": "problem",
            "value": "Mixing-Expressivity Tradeoff (MET)"
          },
          {
            "name": "consequence",
            "value": "Exponentially longer mixing times"
          },
          {
            "name": "impact",
            "value": "Expensive inference and unstable training"
          },
          {
            "name": "solution_type",
            "value": "Addressed by DTM approach"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The DTCA uses all-transistor hardware with subthreshold transistor dynamics for random number generation, avoiding exotic components and enabling CMOS scalability.",
        "content": "To enable a near-term, large-scale realization of the DTCA, we leveraged the shot-noise dynamics of subthreshold transistors [45] to build an RNG that is fast, energy-efficient, and small. Our all-transistor RNG is programmable and has the desired sigmoidal response to a control voltage, as shown by experimental measurements in Fig. 4 (a).",
        "attributes": [
          {
            "name": "hardware_approach",
            "value": "All-transistor implementation"
          },
          {
            "name": "technology",
            "value": "Subthreshold transistor dynamics"
          },
          {
            "name": "scalability",
            "value": "CMOS compatible"
          },
          {
            "name": "component_type",
            "value": "RNG (Random Number Generator)"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "DTMs represent the first scalable method for applying probabilistic hardware to machine learning by chaining multiple EBMs to gradually build complexity.",
        "content": "At the top level, we introduce a new probabilistic computer architecture that runs Denoising Thermodynamic Models (DTMs) instead of monolithic EBMs. As their name suggests, rather than using the hardware's EBM to model data distributions directly, DTMs sequentially compose many hardware EBMs to model a process that denoises the data gradually. Diffusion models [18, 44] also follow this denoising procedure and are much more capable than EBMs. This key architectural change addresses a fundamental issue with previous approaches and represents the first scalable method for applying probabilistic hardware to machine learning.",
        "attributes": [
          {
            "name": "innovation",
            "value": "First scalable probabilistic hardware approach"
          },
          {
            "name": "key_change",
            "value": "Sequential EBM composition"
          },
          {
            "name": "benefit",
            "value": "Avoids mixing-expressivity tradeoff"
          },
          {
            "name": "significance",
            "value": "Fundamental architectural breakthrough"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The research specifically uses sparse Boltzmann machines (Ising models) as the EBM implementation due to their hardware efficiency and simple Gibbs sampling update rules.",
        "content": "The DTM that produced the results shown in Fig. 1 used Boltzmann machine EBMs. Boltzmann machines, also known as Ising models in physics, use binary random variables and are the simplest type of discrete-variable EBM.\n\nBoltzmann machines are hardware efficient because the Gibbs sampling update rule required to sample from them is simple. Boltzmann machines implement energy functions of the form\nE(x) =−β \n⟨\n∑_{i≠j} x_i J_ij x_j + ∑_{i=1}^n h_i x_i\n⟩\n,(10)",
        "attributes": [
          {
            "name": "implementation",
            "value": "Sparse Boltzmann machines"
          },
          {
            "name": "model_type",
            "value": "Ising models/EBMs"
          },
          {
            "name": "efficiency_reason",
            "value": "Simple Gibbs sampling"
          },
          {
            "name": "variable_type",
            "value": "Binary random variables"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "GPU performance per joule doubles every few years, making it extremely difficult for new computing schemes to achieve mainstream adoption despite theoretical advantages.",
        "content": "In addition to these integration challenges, GPU performance per joule is doubling every few years [24], making it very difficult for cutting-edge computing schemes to gain mainstream adoption.",
        "attributes": [
          {
            "name": "challenge",
            "value": "GPU efficiency improvement rate"
          },
          {
            "name": "rate",
            "value": "Doubling every few years"
          },
          {
            "name": "impact",
            "value": "Barriers to new adoption"
          },
          {
            "name": "context",
            "value": "Competitive landscape analysis"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The DTCA architecture can be implemented in various modular configurations, including distinct physical circuitry on the same chip, across multiple communicating chips, or reprogrammed hardware with different weights.",
        "content": "The modular nature of DTMs enables various hardware implementations. For example, each EBM in the chain can be implemented using distinct physical circuitry on the same chip, as shown in Fig. 3 (b). Alternatively, the various EBMs may be split across several communicating chips or implemented by the same hardware, reprogrammed with distinct sets of weights at different times.",
        "attributes": [
          {
            "name": "architecture",
            "value": "Modular DTCA"
          },
          {
            "name": "implementation_options",
            "value": "Multiple chip configurations"
          },
          {
            "name": "flexibility",
            "value": "Reprogrammable hardware"
          },
          {
            "name": "scalability",
            "value": "Various deployment options"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The mixing-expressivity tradeoff (MET) summarizes the coupling between modeling performance and sampling hardness for MEBMs, where increased expressivity leads to longer mixing times and more expensive inference.",
        "content": "The mixing-expressivity tradeoff (MET) summarizes this issue with existing probabilistic computer architectures, reflecting the fact that modeling performance and sampling hardness are coupled for MEBMs. Specifically, as the expressivity (modeling performance) of an MEBM increases, its mixing time (the amount of computational effort needed to draw independent samples from the MEBM's distribution) becomes progressively longer, resulting in expensive inference and unstable training [52, 53].",
        "attributes": [
          {
            "name": "source",
            "value": "Academic paper text"
          },
          {
            "name": "reference",
            "value": "[52, 53]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTMs (Denoising Thermodynamic Models) merge EBMs with diffusion models to provide an alternative probabilistic computing approach that addresses the MET.",
        "content": "DTMs merge EBMs with diffusion models, offering an alternative path for probabilistic computing that assuages the MET. DTMs are a slight generalization of recent work from deep learning practitioners that has pushed the frontier of EBM performance [57–60].",
        "attributes": [
          {
            "name": "source",
            "value": "Academic paper text"
          },
          {
            "name": "reference",
            "value": "[57–60]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The forward process in denoising diffusion models follows a Markov chain structure and has a unique stationary distribution with a simple form.",
        "content": "Denoising models attempt to reverse a process that gradually transforms the data distribution Q(x0) into simple noise. This forward process is given by the Markov chain Q(x0, . . . , xT ) = Q(x0) ∏t=1T Q(xt|xt−1). (3) The forward process is typically chosen such that it has a unique stationary distribution Q(xT ), which takes a simple form (e.g., Gaussian or uniform).",
        "attributes": [
          {
            "name": "source",
            "value": "Academic paper text"
          },
          {
            "name": "equation",
            "value": "Eq. (3)"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "MEBMs have a fundamental flaw that makes them energetically costly to scale at the probabilistic computing level.",
        "content": "The MET makes it clear that MEBMs have a flaw that makes them challenging and energetically costly to scale.",
        "attributes": [
          {
            "name": "source",
            "value": "Author's assessment"
          },
          {
            "name": "perspective",
            "value": "Critical"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "DTMs successfully overcome the mixing-expressivity tradeoff by using a gradual complexity building approach.",
        "content": "Instead of trying to use a single EBM to model the data, DTMs chain many EBMs to gradually build up to the complexity of the data distribution. This gradual buildup of complexity allows the landscape of each EBM in the chain to remain relatively simple (and easy to sample) without limiting the complexity of the distribution modeled by the chain as a whole;",
        "attributes": [
          {
            "name": "source",
            "value": "Author's technical assessment"
          },
          {
            "name": "perspective",
            "value": "Positive/optimistic"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The mixing-expressivity tradeoff creates energy barriers between data modes that make sampling computationally expensive.",
        "content": "For large differences in energy, like those encountered when trying to move between two valleys separated by a significant barrier, this probability can be very close to zero. These barriers grind the iterative sampler to a halt.",
        "attributes": [
          {
            "name": "source",
            "value": "Technical description"
          },
          {
            "name": "metaphor",
            "value": "Valley/barrier analogy"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Denoising diffusion models work by reversing a gradual noise addition process to generate data from simple noise.",
        "content": "Reversal of the forward process is achieved by learning a set of distributions Pθ(xt−1|xt) that approximate the reversal of each conditional in Eq. (3). In doing so, we learn a map from simple noise to the data distribution, which can then be used to generate new data.",
        "attributes": [
          {
            "name": "source",
            "value": "Methodological description"
          },
          {
            "name": "process",
            "value": "Reverse diffusion"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Traditional diffusion models use sufficiently fine-grained forward processes for their implementation.",
        "content": "In traditional diffusion models, the forward process is made to be sufficiently fine-grained (using a large num",
        "attributes": [
          {
            "name": "source",
            "value": "Comparative statement about traditional approaches"
          },
          {
            "name": "completeness",
            "value": "Incomplete sentence"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Traditional diffusion models use a neural network parameterized conditional distribution (Gaussian or categorical) to approximate the reverse process, minimizing KL divergence between joint distributions Q and Pθ",
        "content": "ber of stepsT) such that the conditional distribution of each step in the reverse process takes some simple form (such as Gaussian or categorical). This simple distribution is parameterized by a neural network, which is then trained to minimize the Kullback-Leibler (KL) divergence between the joint distributionsQandP θ, LDN (θ) =D Q(x0, . . . , xT ) Pθ(x0, . . . , xT ) ,(4)",
        "attributes": [
          {
            "name": "method",
            "value": "traditional diffusion models"
          },
          {
            "name": "objective",
            "value": "minimize KL divergence"
          },
          {
            "name": "section",
            "value": "model training"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The joint distribution of diffusion models is the product of learned conditionals: Pθ(x0, ..., xT) = Q(xT) ∏_{t=1}^T Pθ(xt-1|xt)",
        "content": "where the joint distribution of the model is the product of the learned conditionals: Pθ(x0, . . . , xT ) =Q(x T ) TY t=1 Pθ(xt−1|xt).(5)",
        "attributes": [
          {
            "name": "equation",
            "value": "(5)"
          },
          {
            "name": "section",
            "value": "joint distribution"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Energy-Based Models re-cast the forward process in exponential form: Q(xt|xt-1) ∝ e^(-Ef_t-1(xt-1,xt))",
        "content": "In many cases, it is straight- forward to re-cast the forward process in an exponential form, Q(xt|xt−1)∝e −Ef t−1(xt−1,xt),(6)",
        "attributes": [
          {
            "name": "method",
            "value": "EBM re-casting"
          },
          {
            "name": "equation",
            "value": "(6)"
          },
          {
            "name": "section",
            "value": "EBM approach"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTMs generalize EBM approach by introducing latent variables {zt}, allowing independent scaling of model size/complexity from data dimension",
        "content": "To maximally leverage probabilistic hardware for EBM sampling, DTMs generalize Eq. (7) by introducing latent variables{z t}: Pθ(xt−1|xt)∝ X zt−1 e−(Ef t−1(xt−1,xt)+Eθ t−1(xt−1,zt−1,θ)). (8) Introducing latent variables allows the size and complexity of the probabilistic model to be increased independently of the data dimension.",
        "attributes": [
          {
            "name": "method",
            "value": "DTM generalization"
          },
          {
            "name": "equation",
            "value": "(8)"
          },
          {
            "name": "innovation",
            "value": "latent variables"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTMs have the property that exact reverse-process conditional approximation also learns marginal distribution at t-1: Q(xt-1) ∝ ∑_{zt-1} e^(-Eθ_t-1(xt-1,zt-1,θ))",
        "content": "A convenient property of DTMs is that if the ap- proximation to the reverse-process conditional is exact (Pθ(xt−1|xt)→Q(x t−1|xt)), one also learns the marginal distribution att−1, Q(xt−1)∝ X zt−1 e−Eθ t−1(xt−1,zt−1,θ).(9)",
        "attributes": [
          {
            "name": "property",
            "value": "marginal learning"
          },
          {
            "name": "equation",
            "value": "(9)"
          },
          {
            "name": "condition",
            "value": "exact reverse-process approximation"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Increasing T while holding EBM architecture constant increases expressive power and makes each sampling step easier, bypassing MET constraints",
        "content": "As the number of steps in the forward process is increased, the effect of each noising step becomes smaller, meaning that Ef t−1 more tightly bindsx t tox t−1. This binding can simplify the distribution given in Eq. (7)... As illustrated in Fig. 3 (a), models of the form given in Eq. (7) reshape simple noise into an approximation of the data distribution. IncreasingTwhile holding the EBM architecture constant simultaneously increases the expressive power of the chain and makes each step easier to sample from, entirely bypassing the MET.",
        "attributes": [
          {
            "name": "benefit",
            "value": "increased expressive power"
          },
          {
            "name": "advantage",
            "value": "easier sampling"
          },
          {
            "name": "figure",
            "value": "Fig. 3(a)"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "DTCA architecture tightly integrates DTMs into probabilistic hardware for highly efficient implementation, with each EBM implemented by distinct circuitry for input/output conditioning and latent sampling",
        "content": "The Denoising Thermodynamic Computer Architec- ture (DTCA) tightly integrates DTMs into probabilistic hardware, allowing for the highly efficient implementa- (b)A sketch of how a chip based on the DTCA chains hard- ware EBMs to approximate the reverse process. Each EBM is implemented by distinct circuitry, parts of which are dedicated to receiving the inputs and conditionally sampling the outputs and latents.",
        "attributes": [
          {
            "name": "architecture",
            "value": "DTCA"
          },
          {
            "name": "implementation",
            "value": "probabilistic hardware"
          },
          {
            "name": "hardware_component",
            "value": "distinct circuitry per EBM"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The DTCA architecture uses constrained Energy-Based Models (EBMs) with sparse and local connectivity that can be implemented using massively parallel arrays of primitive circuitry performing Gibbs sampling.",
        "content": "Practical implementations of the DTCA utilize natural-to-implement EBMs that exhibit sparse and local connectivity, as is typical in the literature [33]. This constraint allows sampling of the EBM to be performed by massively parallel arrays of primitive circuitry that implement Gibbs sampling.",
        "attributes": [
          {
            "name": "source",
            "value": "DTCA architecture description"
          },
          {
            "name": "technical_approach",
            "value": "hardware implementation"
          },
          {
            "name": "reference",
            "value": "[33]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The reverse process transformation Ef t−1 can be implemented efficiently using pairwise interactions between variables in xt and xt−1, with no constraints on the form of Eθ t−1.",
        "content": "A key feature of the DTCA is thatEf t−1 can be implemented efficiently using our constrained EBMs. Specifically, for both continuous and discrete diffusion,E f t−1 can be implemented using a single pairwise interaction between corresponding variables inxt andx t−1; see Ap- pendix A.1 and C.1 for details. This structure can be reflected in how the chip is laid out to implement these interactions without violating locality constraints. Critically, Eq. (8) places no constraints on the form ofE θ t−1. Therefore, we are free to use EBMs that our hardware implements especially efficiently.",
        "attributes": [
          {
            "name": "technical_approach",
            "value": "algorithm efficiency"
          },
          {
            "name": "constraint_type",
            "value": "pairwise interactions"
          },
          {
            "name": "reference",
            "value": "Eq. (8)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The DTM performance was tested on Fashion-MNIST dataset using GPU simulation with FID metrics and energy consumption estimates, compared against conventional VAE on GPU.",
        "content": "To understand the performance of a future hardware device, we developed a GPU simulator of the DTCA and used it to train a DTM on the Fashion-MNIST dataset. We measure the performance of the DTM using FID and utilize a physical model to estimate the energy required to generate new images. These numbers can be compared to conventional algorithm/hardware pairings, such as a VAE running on a GPU; these results are shown in Fig. 1.",
        "attributes": [
          {
            "name": "evaluation_method",
            "value": "GPU simulation"
          },
          {
            "name": "dataset",
            "value": "Fashion-MNIST"
          },
          {
            "name": "metrics",
            "value": "FID, energy consumption"
          },
          {
            "name": "comparison_baselines",
            "value": "VAE on GPU"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Boltzmann machines with binary random variables were used as the simplest type of discrete-variable EBM, implementing energy functions with specific mathematical form E(x) = −β(Σi≠j xiJijxj + Σi hi xi).",
        "content": "The DTM that produced the results shown in Fig. 1 used Boltzmann machine EBMs. Boltzmann machines, also known as Ising models in physics, use binary random variables and are the simplest type of discrete-variable EBM. Boltzmann machines implement energy functions of the form E(x) =−β 〈X i̸=j xiJijxj + X i=1 hixi 〉,(10) where eachx i ∈ {−1,1}.",
        "attributes": [
          {
            "name": "model_type",
            "value": "Boltzmann machine"
          },
          {
            "name": "variable_type",
            "value": "binary random variables"
          },
          {
            "name": "alternative_name",
            "value": "Ising models"
          },
          {
            "name": "equation_reference",
            "value": "Eq. (10)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Gibbs sampling update rule for Boltzmann machines follows a sigmoidal probability function P(Xi[k+1] = +1|X[k] = x) = σ(2β(Σj≠i Jij xj + hi)), which can be implemented using simple circuitry with appropriately biased random bits.",
        "content": "The Gibbs sampling update rule for sampling from the corresponding EBM is P(X i[k+ 1] = +1|X[k] =x) =σ 〈2β 〈X j̸=i Jij xj+hi 〉〉, (11) which can be evaluated simply using an appropriately biased source of random bits.",
        "attributes": [
          {
            "name": "algorithm",
            "value": "Gibbs sampling"
          },
          {
            "name": "probability_function",
            "value": "sigmoidal"
          },
          {
            "name": "implementation_medium",
            "value": "random bits"
          },
          {
            "name": "equation_reference",
            "value": "Eq. (11)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Boltzmann machines were implemented as sparse, deep models with L×L grids (L=70), where each variable connects to several neighbors (typically 12), following bipartite connectivity patterns for parallel sampling.",
        "content": "Specifically, the EBMs employed in this work were sparse, deep Boltzmann machines comprisingL×Lgrids of binary variables, whereL= 70was used in most cases. Eachvariablewasconnectedtoseveral(inmostcases, 12) of its neighbors following a simple pattern. At random, some of the variables were selected to represent the data xt−1, and the rest were assigned to the latent variables zt−1. Then, an extra node was connected to each data node to implement the coupling toxt.",
        "attributes": [
          {
            "name": "architecture",
            "value": "sparse, deep Boltzmann machines"
          },
          {
            "name": "grid_size",
            "value": "L×L, L=70"
          },
          {
            "name": "connectivity",
            "value": "12 neighbors typically"
          },
          {
            "name": "variable_types",
            "value": "data nodes, latent variables"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The shot-noise dynamics of subthreshold transistors were used to build an RNG that is fast, energy-efficient, and small, with experimental results showing sigmoidal response to control voltage and approximately exponential autocorrelation decaying in ~100ns.",
        "content": "To enable a near-term, large-scale realization of the DTCA, we leveraged the shot-noise dynamics of sub- threshold transistors [45] to build an RNG that is fast, energy-efficient, and small. Our all-transistor RNG is programmable and has the desired sigmoidal response to a control voltage, as shown by experimental measurements in Fig. 4 (a). The stochastic voltage signal output from the RNG has an approximately exponential autocorrelation function that decays in around100ns, as il- lustrated in Fig. 4 (b).",
        "attributes": [
          {
            "name": "hardware_component",
            "value": "RNG"
          },
          {
            "name": "implementation_technology",
            "value": "subthreshold transistors"
          },
          {
            "name": "properties",
            "value": "fast, energy-efficient, small, programmable"
          },
          {
            "name": "response_characteristic",
            "value": "sigmoidal"
          },
          {
            "name": "correlation_time",
            "value": "~100ns"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The modular nature of DTMs enables flexible hardware implementations including distinct physical circuitry per EBM, split across communicating chips, or reprogrammable hardware with different weights at different times.",
        "content": "The modular nature of DTMs enables various hardware implementations. For example, each EBM in the chain can be implemented using distinct physical circuitry on the same chip, as shown in Fig. 3 (b). Alternatively, the various EBMs may be split across several communicating chips or implemented by the same hardware, reprogrammed with distinct sets of weights at different times.",
        "attributes": [
          {
            "name": "design_approach",
            "value": "modular architecture"
          },
          {
            "name": "flexibility",
            "value": "multiple implementation options"
          },
          {
            "name": "hardware_options",
            "value": "distinct circuits, split chips, reprogrammable"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "A practical advantage of the all-transistor RNG design is that detailed foundry-provided models can be used to study manufacturing variations, enabling systematic design optimization.",
        "content": "A practical advantage to our all-transistor RNG is that detailedandprovenfoundry-providedmodelscanbeused to study the effect of manufacturing variations on our",
        "attributes": [
          {
            "name": "design_advantage",
            "value": "manufacturing variability analysis"
          },
          {
            "name": "model_availability",
            "value": "foundry-provided models"
          },
          {
            "name": "optimization_approach",
            "value": "systematic design"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The sampling procedure uses block sampling of bipartite Boltzmann machines, where each color block can be sampled in parallel for K iterations (K≈1000, longer than mixing time) to draw samples from Eq. (7).",
        "content": "Due to our chosen connectivity patterns, our Boltz-mann machines are bipartite (two-colorable). Since each color block can be sampled in parallel, a single itera- tion of Gibbs sampling corresponds to sampling the first colorblockconditionedonthesecondandthenviceversa. Starting from some random initialization, this block sampling procedure could then be repeated forKiterations (whereKis longer than the mixing time of the sampler, typicallyK≈1000) to draw samples from Eq. (7) for each step in the approximation to the reverse process.",
        "attributes": [
          {
            "name": "sampling_method",
            "value": "block sampling"
          },
          {
            "name": "parallelization",
            "value": "bipartite color blocks"
          },
          {
            "name": "iterations",
            "value": "K≈1000"
          },
          {
            "name": "reference",
            "value": "Eq. (7)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTCA implementation uses sparse and locally connected EBMs that enable massively parallel Gibbs sampling via primitive circuitry",
        "content": "Practical implementations of the DTCA utilize natural-to-implement EBMs that exhibit sparse and local connectivity, as is typical in the literature [33]. This constraint allows sampling of the EBM to be performed by massively parallel arrays of primitive circuitry that implement Gibbs sampling.",
        "attributes": [
          {
            "name": "source",
            "value": "Literature [33]"
          },
          {
            "name": "implementation_type",
            "value": "Hardware circuitry"
          },
          {
            "name": "sampling_method",
            "value": "Gibbs sampling"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Ef t−1 can be implemented efficiently using constrained EBMs for both continuous and discrete diffusion",
        "content": "A key feature of the DTCA is thatEf t−1 can be implemented efficiently using our constrained EBMs. Specifically, for both continuous and discrete diffusion,E f t−1 can be implemented using a single pairwise interaction between corresponding variables inxt andx t−1.",
        "attributes": [
          {
            "name": "feature",
            "value": "DTCA efficiency"
          },
          {
            "name": "diffusion_types",
            "value": "Continuous and discrete"
          },
          {
            "name": "implementation",
            "value": "Pairwise interaction"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "EBMs can be scaled arbitrarily by combining hardware latent-variable EBMs into software-defined graphical models",
        "content": "At the lowest level, this corresponds to high-dimensional, regularly structured latent variable EBM. If more powerful models are desired, these hardware latent-variable EBMs can be arbitrarily scaled by combining them into software-defined graphical models.",
        "attributes": [
          {
            "name": "scalability",
            "value": "Arbitrary scaling"
          },
          {
            "name": "model_type",
            "value": "Graphical models"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTM hardware implementations offer modular design options including distinct circuitry per EBM, split across chips, or reprogrammable hardware",
        "content": "The modular nature of DTMs enables various hardware implementations. For example, each EBM in the chain can be implemented using distinct physical circuitry on the same chip, as shown in Fig. 3 (b). Alternatively, the various EBMs may be split across several communicating chips or implemented by the same hardware, reprogrammed with distinct sets of weights at different times.",
        "attributes": [
          {
            "name": "modularity",
            "value": "Various implementations"
          },
          {
            "name": "reference",
            "value": "Fig. 3 (b)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "GPU simulator developed for DTCA achieved FID-validated performance on Fashion-MNIST dataset with energy efficiency compared to VAE on GPU",
        "content": "To understand the performance of a future hardware device, we developed a GPU simulator of the DTCA and used it to train a DTM on the Fashion-MNIST dataset. We measure the performance of the DTM using FID and utilize a physical model to estimate the energy required to generate new images. These numbers can be compared to conventional algorithm/hardware pairings, such as a VAE running on a GPU; these results are shown in Fig. 1.",
        "attributes": [
          {
            "name": "dataset",
            "value": "Fashion-MNIST"
          },
          {
            "name": "performance_metric",
            "value": "FID"
          },
          {
            "name": "reference",
            "value": "Fig. 1"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Boltzmann machines serve as hardware-efficient EBMs due to simple Gibbs sampling update rules",
        "content": "Boltzmann machines are hardware efficient because the Gibbs sampling update rule required to sample from them is simple. Boltzmann machines implement energy functions of the form E(x) =−β⟨∑i̸=j xiJijxj + ∑i=1 hixi⟩,(10), where eachx i ∈ {−1,1}.",
        "attributes": [
          {
            "name": "efficiency_reason",
            "value": "Simple Gibbs sampling"
          },
          {
            "name": "variable_type",
            "value": "Binary"
          },
          {
            "name": "reference",
            "value": "Eq. (10)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Boltzmann machine hardware implementation uses regular grid of Bernoulli sampling circuits with sigmoidal bias control",
        "content": "Implementing our proposed hardware architecture using Boltzmann machines is particularly simple. A device will consist of a regular grid of Bernoulli sampling circuits, where each sampling circuit implements the Gibbs sampling update for a single variablex i. The bias of the sampling circuits (probability that it produces 1 as opposed to−1) is constrained to be a sigmoidal function of an input voltage, allowing the conditional update given in Eq. (11) to be implemented using a simple circuit that adds currents such as a resistor network.",
        "attributes": [
          {
            "name": "circuit_type",
            "value": "Bernoulli sampling circuits"
          },
          {
            "name": "bias_control",
            "value": "Sigmoidal function"
          },
          {
            "name": "reference",
            "value": "Eq. (11)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Boltzmann machines use 70×70 grids with 12 neighbor connections in bipartite configuration",
        "content": "Due to our chosen connectivity patterns, our Boltzmann machines are bipartite (two-colorable). Since each color block can be sampled in parallel, a single iteration of Gibbs sampling corresponds to sampling the first colorblockconditiononthesecondandthenviceversa. Starting from some random initialization, this block sampling procedure could then be repeated forKiterations (whereKis longer than the mixing time of the sampler, typicallyK≈1000) to draw samples from Eq. (7) for each step in the approximation to the reverse process.",
        "attributes": [
          {
            "name": "grid_size",
            "value": "70×70"
          },
          {
            "name": "connections",
            "value": "12 neighbors"
          },
          {
            "name": "iterations",
            "value": "K≈1000"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Subthreshold transistor shot-noise dynamics enabled fast, energy-efficient programmable RNG with 100ns autocorrelation decay",
        "content": "To enable a near-term, large-scale realization of the DTCA, we leveraged the shot-noise dynamics of subthreshold transistors [45] to build an RNG that is fast, energy-efficient, and small. Our all-transistor RNG is programmable and has the desired sigmoidal response to a control voltage, as shown by experimental measurements in Fig. 4 (a). The stochastic voltage signal output from the RNG has an approximately exponential autocorrelation function that decays in around100ns, as illustrated in Fig. 4 (b).",
        "attributes": [
          {
            "name": "technology",
            "value": "Subthreshold transistors [45]"
          },
          {
            "name": "performance",
            "value": "100ns decay"
          },
          {
            "name": "reference",
            "value": "Fig. 4 (a), (b)"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "DTCA hardware efficiency is a key advantage due to unconstrained Eθ t−1 allowing selection of hardware-optimized EBMs",
        "content": "Critically, Eq. (8) places no constraints on the form ofE θ t−1. Therefore, we are free to use EBMs that our hardware implements especially efficiently.",
        "attributes": [
          {
            "name": "advantage",
            "value": "Hardware efficiency"
          },
          {
            "name": "constraint",
            "value": "None on Eθ t−1"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Near-term DTCA realization requires leveraging transistor shot-noise dynamics for practical RNG implementation",
        "content": "To enable a near-term, large-scale realization of the DTCA, we leveraged the shot-noise dynamics of subthreshold transistors [45] to build an RNG that is fast, energy-efficient, and small.",
        "attributes": [
          {
            "name": "strategy",
            "value": "Near-term realization"
          },
          {
            "name": "component",
            "value": "RNG implementation"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "DTCA architecture references multiple appendices (B, C, A.1, C.1, D.1, C, J) for theoretical discussion and implementation details",
        "content": "Refer to Appendices B and C for a further theoretical discussion of the hardware architecture. See Appendix D.1. Appendix C provides further details on the Boltzmann machine architecture. Appendix J provides further details about our RNG.",
        "attributes": [
          {
            "name": "documentation",
            "value": "Multiple appendices"
          },
          {
            "name": "topics",
            "value": "Theoretical discussion, architecture details"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "DTCA performance evaluation includes FID metrics and energy consumption comparisons against VAE baseline",
        "content": "We measure the performance of the DTM using FID and utilize a physical model to estimate the energy required to generate new images. These numbers can be compared to conventional algorithm/hardware pairings, such as a VAE running on a GPU; these results are shown in Fig. 1.",
        "attributes": [
          {
            "name": "evaluation_metrics",
            "value": "FID, Energy consumption"
          },
          {
            "name": "baseline",
            "value": "VAE on GPU"
          },
          {
            "name": "reference",
            "value": "Fig. 1"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The document describes a programmable random number generator (RNG) with operating characteristics that can be controlled by varying an input voltage, with the probability of high state output following a sigmoid function relationship.",
        "content": "FIG. 4.A programmable source of random bits. (a)A laboratory measurement of the operating characteristic of our RNG. The probability of the output voltage signal being in the high state (x= 1) can be programmed by varying an input voltage. The relationship betweenP(x= 1)and the input voltage is well-approximated by a sigmoid function.",
        "attributes": [
          {
            "name": "source",
            "value": "Fig. 4(a)"
          },
          {
            "name": "component",
            "value": "RNG operating characteristics"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The RNG's autocorrelation function shows exponential decay with a time constant τ0 ≈ 100ns at the unbiased point (P(x=1) = 0.5), indicating good random behavior.",
        "content": "(b)The autocorrelation function of the RNG at the unbiased point (P(x= 1) = 0.5). The decay is approximately exponential with the rateτ0 ≈100ns.",
        "attributes": [
          {
            "name": "source",
            "value": "Fig. 4(b)"
          },
          {
            "name": "measurement",
            "value": "autocorrelation decay"
          },
          {
            "name": "value",
            "value": "τ0 ≈ 100ns"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Manufacturing variation studies show that the RNG works reliably across different process corners, with the 'slow NMOS, fast PMOS' case being the worst performer due to design asymmetry.",
        "content": "(c)Estimating the effect of manufacturing variation on RNG performance. Each point in the plot represents the results of a simulation of an RNG circuit with transistor parameters sampled according to a procedure defined by the manufacturer's PDK. Each color represents a different process corner, each for which∼200realizations of the RNG were simulated. The \"typical\" corner represents a balanced case, whereas the other two are asymmetric corners where the two types of transistors (NMOS and PMOS) are skewed in opposite directions. The slow NMOS and fast PMOS case is worst performing for us due to an asymmetry in our design.",
        "attributes": [
          {
            "name": "source",
            "value": "Fig. 4(c)"
          },
          {
            "name": "analysis_type",
            "value": "manufacturing variation"
          },
          {
            "name": "finding",
            "value": "Reliable across process corners"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The energy consumption of the probabilistic computer is modeled using a physical model of an all-transistor Boltzmann machine Gibbs sampler, with energy contributions from RNG, bias, clock, and communication components.",
        "content": "The energy estimates given in Fig. 1 for the probabilistic computer were constructed using a physical model of an all-transistor Boltzmann machine Gibbs sampler. The dominant contributions to this model are captured by the formula E=T KmixL2Ecell,(12) Ecell =E rng +E bias +E clock +E comm,(13) whereE rng comes from the data in Fig. 4 (c).",
        "attributes": [
          {
            "name": "model_type",
            "value": "Boltzmann machine Gibbs sampler"
          },
          {
            "name": "components",
            "value": "RNG, bias, clock, communication"
          },
          {
            "name": "equation",
            "value": "Ecell = E_rng + E_bias + E_clock + E_comm"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The estimated cell energy consumption Ecell ≈ 2fJ is derived from a physical model using the same transistor process as the RNG with reasonable parameter selections.",
        "content": "Generally, given the same transistor process we used for our RNG and some reasonable selections for other free parameters of the model, we can estimate Ecell ≈2fJ. See Appendix D for an exhaustive derivation of this model.",
        "attributes": [
          {
            "name": "value",
            "value": "Ecell ≈ 2fJ"
          },
          {
            "name": "derivation",
            "value": "Physical model"
          },
          {
            "name": "source",
            "value": "Appendix D"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Energy consumption for GPU implementation is estimated by computing total floating-point operations (FLOPs) required and dividing by manufacturer's FLOP/joule specification.",
        "content": "We use a simple model for the energy consumption of the GPU that underestimates the actual values. We compute the total number of floating-point operations (FLOPs) required to generate a sample from the trained model and divide that by the FLOP/joule specification given by the manufacturer.",
        "attributes": [
          {
            "name": "method",
            "value": "FLOPs / manufacturer specification"
          },
          {
            "name": "component",
            "value": "GPU energy estimation"
          },
          {
            "name": "reference",
            "value": "Appendix E"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Energy-based models (EBMs) are trained using Monte-Carlo estimators for gradient computation, with terms computed independently for each time step t.",
        "content": "The EBMs used in the experiments presented in Fig. 1 were trained by applying the standard Monte-Carlo estimator for the gradients of EBMs [61] to Eq. (4), which yields ∇θLDN (θ)= TX t=1 EQ(xt−1,xt) [EPθ(zt−1|xt−1,xt) [∇θEm t−1 ] −EPθ(xt−1,zt−1|xt) [∇θEm t−1 ] ] . (14) Notably, each term in the sum overtcan be computed independently.",
        "attributes": [
          {
            "name": "method",
            "value": "Monte-Carlo gradient estimation"
          },
          {
            "name": "equation",
            "value": "Eq. (14)"
          },
          {
            "name": "characteristic",
            "value": "Independent term computation"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTMs allow EBMs to have finite and short mixing times, enabling nearly unbiased gradient estimates through sufficient sampling iterations, unlike MEBMs which typically have long mixing times making unbiased gradient estimates impossible in most cases.",
        "content": "It should be noted that the DTCA allows our EBMs to have finite and short mixing times, which enables sufficient sampling iterations to be used to achieve nearly unbiased estimates of the gradient. Unbiased gradient estimates are not possible for MEBMs in most cases due to their long mixing times [62].",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "reference",
            "value": "[62]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTMs significantly improve training stability compared to MEBMs, and when complemented with ACP (Annealed Control Process), completely stabilize the training process.",
        "content": "DTMs alleviate the training instability that is fundamental to MEBMs... An example of the training dynamics for several different types of models is shown in Fig. 5 (b)... Complementing DTMs with the ACP completely stabilizes training.",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "reference",
            "value": "Fig. 5(b)"
          },
          {
            "name": "method",
            "value": "DTM + ACP"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "MEBM training becomes unstable as the model becomes complex and multimodal during training, causing samples to deviate from equilibrium and gradients to lose meaningful direction.",
        "content": "However, as these gradients are followed, the MEBM is reshaped according to the data distribution and begins to become complex and multimodal. This induced multimodality greatly increases the sampling complexity of the distribution, causing samples to deviate from equilibrium. Gradients computed using non-equilibrium samples do not necessarily point in a meaningful direction, which can halt or, in some cases, even reverse the training process.",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "problem",
            "value": "training instability"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Training stability can be measured using normalized autocorrelation (ryy[k]) where values close to 1 indicate far-from-equilibrium samples and low-quality gradients, while values close to 0 indicate samples near equilibrium and high-quality gradient estimates.",
        "content": "The lower plot in Fig. 5 (b) shows the autocorrelation at a delay equal to the total number of sampling iterations used to estimate the gradients during training. Generally, if r_yy is close to 1, gradients were estimated using far-from-equilibrium samples and were likely of low quality. If it is close to zero, the samples should be close to equilibrium and produce high-quality gradient estimates.",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "metric",
            "value": "normalized autocorrelation r_yy[k]"
          },
          {
            "name": "reference",
            "value": "Eq. (15), (16)"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Denoising models stabilize training by implementing simpler transformations per layer, reducing the complexity of the distribution the model must learn and making it easier to sample from.",
        "content": "Denoising alone significantly stabilizes training. Because the transformation carried out by each layer is simpler, the distribution that the model must learn is less complex and, therefore, easier to sample from.",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "method",
            "value": "denoising"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "EBM performance scales with complexity - layers with more connectivity and longer allowed mixing times can utilize more latent variables and achieve higher performance, as demonstrated by varying grid size L in Fashion-MNIST experiments.",
        "content": "The effect of scaling EBM complexity on DTM performance. The grid size L was modified to change the number of latent variables compared to the (fixed) number of data variables. Generally, EBM layers with more connectivity and longer allowed mixing times can utilize more latent variables and, therefore, achieve higher performance.",
        "attributes": [
          {
            "name": "source",
            "value": "text"
          },
          {
            "name": "reference",
            "value": "Fig. 5(c)"
          },
          {
            "name": "dataset",
            "value": "Fashion-MNIST"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "DTM training becomes unstable due to complex energy landscape development among latent variables",
        "content": "As training progresses, the DTM eventually becomes unstable, which can be attributed to the development of a complex energy landscape among the latent variables.",
        "attributes": [
          {
            "name": "section",
            "value": "Training Stability"
          },
          {
            "name": "confidence",
            "value": "high"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Total correlation penalty added to loss function to penalize poorly mixing models",
        "content": "We add a term to the loss function that nudges the optimization towards a distribution that is easy to sample from",
        "attributes": [
          {
            "name": "section",
            "value": "Training Procedure"
          },
          {
            "name": "method",
            "value": "Total Correlation Penalty"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Total loss function combines DN loss and total correlation penalty",
        "content": "The total loss function is the sum of Eq. (4) and this total correlation penalty: L=L DN + Σ_{t=1}^{T} λtLTC t",
        "attributes": [
          {
            "name": "section",
            "value": "Loss Function"
          },
          {
            "name": "equation",
            "value": "(18)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Adaptive Correlation Penalty (ACP) provides closed-loop control of correlation penalty strengths",
        "content": "We use an Adaptive Correlation Penalty (ACP) to set the λt as large as necessary to keep sampling tractable for each layer",
        "attributes": [
          {
            "name": "section",
            "value": "Adaptive Control"
          },
          {
            "name": "method",
            "value": "ACP"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Increasing DTM depth from 2 to 8 substantially improves image generation quality",
        "content": "As shown in Fig. 1, increasing the depth of the DTM from 2 to 8 substantially improves the quality of generated images",
        "attributes": [
          {
            "name": "section",
            "value": "Scaling Analysis"
          },
          {
            "name": "figure",
            "value": "Fig. 1"
          },
          {
            "name": "improvement",
            "value": "substantial"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Larger values of K are required to support wider models with constant connectivity",
        "content": "which demonstrates that larger values of K are required to support wider models holding connectivity constant",
        "attributes": [
          {
            "name": "section",
            "value": "Scaling Constraints"
          },
          {
            "name": "parameter",
            "value": "K"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Probabilistic ML hardware should be scaled as part of hybrid systems rather than in isolation",
        "content": "we hypothesize that the correct way to scale probabilistic machine learning hardware systems is not in isolation but rather as a component in a larger hybrid thermodynamic-deterministic machine learning (HTDML) system",
        "attributes": [
          {
            "name": "section",
            "value": "Conclusion"
          },
          {
            "name": "hypothesis",
            "value": "HTDML scaling approach"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Hardware-efficient EBM topology cannot be scaled in isolation to model arbitrarily complex datasets",
        "content": "It would be naive to expect that a hardware-efficient EBM topology can be scaled in isolation to model arbitrarily complex datasets",
        "attributes": [
          {
            "name": "section",
            "value": "Scaling Limitations"
          },
          {
            "name": "confidence",
            "value": "qualified"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Deterministic processors are sometimes better tools for specific ML tasks",
        "content": "A hybrid approach is sensible because there is no a priori reason to believe that a probabilistic computer should handle every part of a machine learning problem, and sometimes a deterministic processor is likely a better tool for the job",
        "attributes": [
          {
            "name": "section",
            "value": "Hybrid Approach Rationale"
          },
          {
            "name": "rationale",
            "value": "task-specific suitability"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Training dynamics show monotonic quality improvement with small autocorrelations under closed-loop control",
        "content": "Model quality increases monotonically, and the autocorrelation stays small throughout training. This closed-loop control of the correlation penalty was employed during the training of most models used to produce the results in this article",
        "attributes": [
          {
            "name": "section",
            "value": "Training Dynamics"
          },
          {
            "name": "policy",
            "value": "closed-loop control"
          },
          {
            "name": "figure",
            "value": "Fig. 5 (b)"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "HTDML energy landscape decomposes into deterministic and probabilistic components",
        "content": "Mathematically, the landscape of HTDML may be summarized as Etot(S, D, p) =Edet(S, D, p) +Eprob(S, D, p)",
        "attributes": [
          {
            "name": "section",
            "value": "HTDML Formulation"
          },
          {
            "name": "equation",
            "value": "(19)"
          },
          {
            "name": "components",
            "value": "deterministic + probabilistic"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "A DTM trained to generate CIFAR-10 images achieves performance parity with a traditional GAN using approximately 10× smaller deterministic neural network",
        "content": "The DTM is trained to generate CIFAR-10 images and achieves performance parity with a traditional GAN using a∼10×smaller deterministic neural network.",
        "attributes": [
          {
            "name": "source",
            "value": "Figure 6 description"
          },
          {
            "name": "dataset",
            "value": "CIFAR-10"
          },
          {
            "name": "comparison",
            "value": "DTM vs traditional GAN"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Binarization is not viable as a general approach for embedding data into hardware EBMs",
        "content": "Indeed, binarization is not viable in general, and embedding into richer types of variables (such as categorical) at the probabilistic hardware level is not particularly efficient or principled.",
        "attributes": [
          {
            "name": "source",
            "value": "Technical analysis section"
          },
          {
            "name": "method",
            "value": "binarization critique"
          },
          {
            "name": "scope",
            "value": "general applicability"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Current embedding methods using autoencoders and DTMs are not jointly trained, which may result in suboptimal embedding due to DTM's limited connectivity",
        "content": "One major flaw with our method is that the autoencoder and DTM are not jointly trained, which means that the embedding learned by the autoencoder may not be well-suited to the way information can flow in the DTM, given its limited connectivity.",
        "attributes": [
          {
            "name": "source",
            "value": "Analysis of embedding method"
          },
          {
            "name": "limitation",
            "value": "joint training not implemented"
          },
          {
            "name": "hardware_constraint",
            "value": "DTM limited connectivity"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "A 6×6µm chip could potentially fit approximately 10^6 sampling cells based on current RNG size estimates",
        "content": "Based on the size of our RNG, it can be estimated that∼10 6 sampling cells could be fit into a6×6µm chip (see Appendix J).",
        "attributes": [
          {
            "name": "source",
            "value": "Scalability analysis"
          },
          {
            "name": "chip_size",
            "value": "6×6µm"
          },
          {
            "name": "capacity",
            "value": "~10^6 sampling cells"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "There exists a significant gap between current model sizes and potential hardware capabilities, with the largest DTM using only around 50,000 cells compared to potential 10^6 capacity",
        "content": "In contrast, the largest DTM shown in Fig. 1 would use only around 50,000 cells.",
        "attributes": [
          {
            "name": "source",
            "value": "Scalability comparison"
          },
          {
            "name": "current_model_size",
            "value": "~50,000 cells"
          },
          {
            "name": "potential_capacity",
            "value": "~10^6 cells"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Optimal solutions in HTDML will likely be found between deterministic and probabilistic extremes where subsystem contributions are nearly balanced",
        "content": "Like many engineered systems, optimal solutions will be found somewhere in the middle, where the contributions from the various subsystems are nearly balanced [65–67].",
        "attributes": [
          {
            "name": "source",
            "value": "System design philosophy"
          },
          {
            "name": "approach",
            "value": "balanced subsystem design"
          },
          {
            "name": "reference",
            "value": "[65-67]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "GPU simulation of hardware EBMs is inefficient due to sparse data structures not matching regular tensor data types",
        "content": "One difficulty with HTDML research is that simulating large hardware EBMs on GPUs can be a challenging task. GPUs run these EBMs much less efficiently than probabilistic computers and the sparse data structures that naturally arise when working with hardware EBMs do not mesh well with regular tensor data types.",
        "attributes": [
          {
            "name": "source",
            "value": "Research challenges"
          },
          {
            "name": "platform",
            "value": "GPU simulation"
          },
          {
            "name": "issue",
            "value": "sparse data structures vs regular tensors"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "A JAX-based software library with XLA acceleration has been developed for simulating hardware EBMs",
        "content": "We have both short and long-term solutions to these challenges. To address these challenges in the short term, we have open-sourced a software library [69] that enables XLA-accelerated [70] simulation of hardware EBMs. This library is written in JAX [71] and automates the complex slicing operations that enable hardware EBM sampling.",
        "attributes": [
          {
            "name": "source",
            "value": "Software solution"
          },
          {
            "name": "technology",
            "value": "JAX with XLA acceleration"
          },
          {
            "name": "availability",
            "value": "open-sourced"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Page 10 contains a comprehensive bibliography section with 40 references spanning 2006-2025, covering topics in AI, machine learning, quantum computing, and related scientific fields.",
        "content": "10\n[1] A. A. Chien, Commun. ACM66, 5 (2023).\n[2] D. D. Stine,The Manhattan Project, the Apollo Program,\nand Federal Energy Technology R&D Programs: A Com-\nparative Analysis, Report RL34645 (Congressional Re-\nsearch Service, Washington, D.C., 2009).\n[3] J. Aljbour, T. Wilson, and P. Patel, EPRI White Paper\nno. 3002028905 (2024).\n[4] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser,\nR. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D.\nLago, T. Hubert, P. Choy, C. de Masson d'Autume,\nI. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal,\nA. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Rob-\nson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and\nO. Vinyals, Science378, 1092 (2022).\n[5] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo,\nPhilos. Trans. R. Soc. A382, 20230254 (2024).\n[6] H. Nori, N. King, S. M. McKinney, D. Carignan, and\nE. Horvitz, arXiv [cs.CL] (2023).\n[7] S. Noy and W. Zhang, Science381, 187 (2023).\n[8] E. Brynjolfsson, D. Li, and L. Raymond, Q. J. Econ.\n10.1093/qje/qjae044 (2025).\n[9] S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer,\narXiv [cs.SE] (2023).\n[10] A. Bick, A. Blandin, and D. J. Deming, The rapid adop-\ntion of generative ai, Tech. Rep. (National Bureau of Eco-\nnomic Research, 2024).\n[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,\nL. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin,\ninAdvances in Neural Information Processing Systems,\nVol. 30, edited by I. Guyon, U. V. Luxburg, S. Bengio,\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett\n(Curran Associates, Inc., 2017).\n[12] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng,\nand B. Catanzaro, inProceedings of the 30th Interna-\ntional Conference on International Conference on Ma-\nchine Learning - Volume 28, ICML'13 (JMLR.org, 2013)\np. III–1337–III–1345.\n[13] K. Chellapilla, S. Puri, and P. Simard, inTenth Inter-\nnational Workshop on Frontiers in Handwriting Recogni-\ntion, edited by G. Lorette, Université de Rennes 1 (Su-\nvisoft, La Baule (France), 2006).\n[14] H. Xiao, K. Rasul, and R. Vollgraf, arXiv [cs.LG] (2017).\n[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,\nand S. Hochreiter, inAdvances in Neural Information Pro-\ncessing Systems, Vol. 30, edited by I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett (Curran Associates, Inc.,\n2017).\n[16] D.P.KingmaandM.Welling,Auto-Encoding Variational\nBayes(2022).\n[17] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,\nD. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-\nbio, inAdvances in Neural Information Processing Sys-\ntems, Vol. 27, edited by Z. Ghahramani, M. Welling,\nC. Cortes, N. Lawrence, and K. Weinberger (Curran As-\nsociates, Inc., 2014).\n[18] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and\nS. Ganguli, inProceedings of the 32nd International Con-\nference on Machine Learning, Proceedings of Machine\nLearning Research, Vol. 37, edited by F. Bach and D. Blei\n(PMLR, Lille, France, 2015) pp. 2256–2265.\n[19] S. Hooker, Commun. ACM64, 58–65 (2021).\n[20] S. Ambrogio, P. Narayanan, A. Okazaki, A. Fasoli,\nC. Mackin, K. Hosokawa, A. Nomura, T. Yasuda,\nA. Chen, A. Friz,et al., Nature620, 768 (2023).\n[21] S. Bandyopadhyay, A. Sludds, S. Krastanov, R. Hamerly,\nN. Harris, D. Bunandar, M. Streshinsky, M. Hochberg,\nand D. Englund, Nat. Photon.18, 1335 (2024).\n[22] H. A. Gonzalez, J. Huang, F. Kelber, K. K. Nazeer,\nT.Langer, C.Liu, M.Lohrmann, A.Rostami, M.Schone,\nB. Vogginger,et al., arXiv [cs.ET] (2024).\n[23] S. B. Shrestha, J. Timcheck, P. Frady, L. Campos-\nMacias, and M. Davies, inICASSP 2024 - 2024 IEEE\nInternational Conference on Acoustics, Speech and Sig-\nnal Processing (ICASSP)(2024) pp. 13481–13485.\n[24] Y. Sun, N. B. Agostini, S. Dong, and D. Kaeli, arXiv\n[cs.DC] (2019).\n[25] Y. Song and S. Ermon, inAdvances in Neural Informa-\ntion Processing Systems, Vol. 32, edited by H. Wallach,\nH. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,\nand R. Garnett (Curran Associates, Inc., 2019).\n[26] M. Janner, Y. Du, J. Tenenbaum, and S. Levine, inInter-\nnational Conference on Machine Learning(PMLR, 2022)\npp. 9902–9915.\n[27] N. S. Singh, K. Kobayashi, Q. Cao, K. Selcuk, T. Hu,\nS. Niazi, N. A. Aadit, S. Kanai, H. Ohno, S. Fukami,\net al., Nat. Commun.15, 2685 (2024).\n[28] C. Pratt, K. Ray, and J. Crutchfield,Dynamical Com-\nputing on the Nanoscale: Superconducting Circuits for\nThermodynamically-Efficient Classical Information Pro-\ncessing(2023).\n[29] G. Wimsatt, O.-P. Saira, A. B. Boyd, M. H. Matheny,\nS. Han, M. L. Roukes, and J. P. Crutchfield, Phys. Rev.\nRes.3, 033115 (2021).\n[30] S. H. Adachi and M. P. Henderson,Application of Quan-\ntum Annealing to Training of Deep Neural Networks\n(2015).\n[31] B. Sutton, K. Y. Camsari, B. Behin-Aein, and S. Datta,\nSci. Rep.7, 44370 (2017).\n[32] R. Faria, K. Y. Camsari, and S. Datta, IEEE Magn. Lett.\n8, 1 (2017).\n[33] S.Niazi, S.Chowdhury, N.A.Aadit, M.Mohseni, Y.Qin,\nand K. Y. Camsari, Nat. Electron.7, 610 (2024).\n[34] W. A. Borders, A. Z. Pervaiz, S. Fukami, K. Y. Camsari,\nH. Ohno, and S. Datta, Nature573, 390 (2019).\n[35] N. S. Singh, K. Kobayashi, Q. Cao, K. Selcuk, T. Hu,\nS. Niazi, N. A. Aadit, S. Kanai, H. Ohno, S. Fukami,\net al., Nat. Commun.15, 2685 (2024).\n[36] M. M. H. Sajeeb, N. A. Aadit, S. Chowdhury, T. Wu,\nC. Smith, D. Chinmay, A. Raut, K. Y. Camsari, C. Dela-\ncour, and T. Srimani, Phys. Rev. Appl.24, 014005\n(2025).\n[37] T. Conte, E. DeBenedictis, N. Ganesh, T. Hylton,\nJ. P. Strachan, R. S. Williams, A. Alemi, L. Altenberg,\nG. Crooks, J. Crutchfield,et al., arXiv [cs.CY] (2019).\n[38] Y. Du and I. Mordatch, inAdvances in Neural Informa-\ntion Processing Systems, Vol. 32, edited by H. Wallach,\nH. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox,\nand R. Garnett (Curran Associates, Inc., 2019).\n[39] W. Lee, H. Kim, H. Jung, Y. Choi, J. Jeon, and C. Kim,\nSci. Rep.15, 8018 (2025).\n[40] M. Horodynski, C. Roques-Carmes, Y. Salamin, S. Choi,",
        "attributes": [
          {
            "name": "page_number",
            "value": "10"
          },
          {
            "name": "section",
            "value": "References/Bibliography"
          },
          {
            "name": "total_entries",
            "value": "40"
          },
          {
            "name": "date_range",
            "value": "2006-2025"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The bibliography includes recent cutting-edge AI research from 2022-2025, including generative AI adoption studies, neural network research, and quantum computing applications.",
        "content": "[4] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser,\nR. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D.\nLago, T. Hubert, P. Choy, C. de Masson d'Autume,\nI. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal,\nA. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Rob-\nson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and\nO. Vinyals, Science378, 1092 (2022).\n[5] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo,\nPhilos. Trans. R. Soc. A382, 20230254 (2024).\n[6] H. Nori, N. King, S. M. McKinney, D. Carignan, and\nE. Horvitz, arXiv [cs.CL] (2023).\n[7] S. Noy and W. Zhang, Science381, 187 (2023).\n[8] E. Brynjolfsson, D. Li, and L. Raymond, Q. J. Econ.\n10.1093/qje/qjae044 (2025).\n[10] A. Bick, A. Blandin, and D. J. Deming, The rapid adop-\ntion of generative ai, Tech. Rep. (National Bureau of Eco-\nnomic Research, 2024).",
        "attributes": [
          {
            "name": "page_number",
            "value": "10"
          },
          {
            "name": "focus_area",
            "value": "Recent AI Research (2022-2025)"
          },
          {
            "name": "publication_types",
            "value": "Science, arXiv, Economic Research"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The references show a strong focus on neural network foundations and deep learning research, including classic papers on GANs, transformers, and variational autoencoders.",
        "content": "[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,\nL. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin,\ninAdvances in Neural Information Processing Systems,\nVol. 30, edited by I. Guyon, U. V. Luxburg, S. Bengio,\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett\n(Curran Associates, Inc., 2017).\n[14] H. Xiao, K. Rasul, and R. Vollgraf, arXiv [cs.LG] (2017).\n[16] D.P.KingmaandM.Welling,Auto-Encoding Variational\nBayes(2022).\n[17] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,\nD. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-\nbio, inAdvances in Neural Information Processing Sys-\ntems, Vol. 27, edited by Z. Ghahramani, M. Welling,\nC. Cortes, N. Lawrence, and K. Weinberger (Curran As-\nsociates, Inc., 2014).\n[15] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,\nand S. Hochreiter, inAdvances in Neural Information Pro-\ncessing Systems, Vol. 30, edited by I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett (Curran Associates, Inc.,\n2017).",
        "attributes": [
          {
            "name": "page_number",
            "value": "10"
          },
          {
            "name": "topic_area",
            "value": "Neural Network Foundations"
          },
          {
            "name": "key_papers",
            "value": "Transformers, GANs, VAEs, Deep Learning"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The bibliography includes significant quantum computing and neuromorphic research, with multiple references to applications in AI training and information processing.",
        "content": "[28] C. Pratt, K. Ray, and J. Crutchfield,Dynamical Com-\nputing on the Nanoscale: Superconducting Circuits for\nThermodynamically-Efficient Classical Information Pro-\ncessing(2023).\n[29] G. Wimsatt, O.-P. Saira, A. B. Boyd, M. H. Matheny,\nS. Han, M. L. Roukes, and J. P. Crutchfield, Phys. Rev.\nRes.3, 033115 (2021).\n[30] S. H. Adachi and M. P. Henderson,Application of Quan-\ntum Annealing to Training of Deep Neural Networks\n(2015).\n[31] B. Sutton, K. Y. Camsari, B. Behin-Aein, and S. Datta,\nSci. Rep.7, 44370 (2017).\n[32] R. Faria, K. Y. Camsari, and S. Datta, IEEE Magn. Lett.\n8, 1 (2017).\n[33] S.Niazi, S.Cowdhury, N.A.Aadit, M.Mohseni, Y.Qin,\nand K. Y. Camsari, Nat. Electron.7, 610 (2024).\n[34] W. A. Borders, A. Z. Pervaiz, S. Fukami, K. Y. Camsari,\nH. Ohno, and S. Datta, Nature573, 390 (2019).",
        "attributes": [
          {
            "name": "page_number",
            "value": "10"
          },
          {
            "name": "research_area",
            "value": "Quantum Computing & Neuromorphic Computing"
          },
          {
            "name": "applications",
            "value": "Neural Network Training, Information Processing"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Denoising diffusion models learn to time-reverse a random process that converts data into simple noise",
        "content": "Denoising diffusion models try to learn to time-reverse a random process that converts data into simple noise. Here, we will review some details on how these models work to support the analysis in the main text.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Introduction"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The forward process converts data distribution into noise through stochastic differential equations (continuous case) or Markov jump processes (discrete case)",
        "content": "The forward process is a random process that is used to convert the data distribution into noise. This conversion into noise is achieved through a stochastic differential equation in the continuous-variable case and a Markov jump process in the discrete case.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Forward Processes"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "In the continuous case, the typical forward process is Itô diffusion with the specific equation dX(t) = −X(t)dt + √2σdW",
        "content": "In the continuous case, the typical choice of forward process is the Itô diffusion, dX(t) =−X(t)dt+ √2σdW where X(t) is a length N vector representing the state variable at time t, σ is a constant, and dW is a length N vector of independent Wiener processes.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Continuous Variables"
          },
          {
            "name": "equation",
            "value": "A1"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The transition kernel defines how probability distribution evolves over time, given by Qt|0(x′|x) = P(X(t) =x ′|X(0) =x)",
        "content": "The transition kernel for a random process defines how the probability distribution evolves in time, Qt|0(x′|x) =P(X(t) =x ′|X(0) =x)",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Continuous Variables"
          },
          {
            "name": "equation",
            "value": "A2"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For the Itô diffusion case, the transition kernel follows a Gaussian distribution with mean µ=e^(-tx) and covariance matrix Σ=σ²I(1−e^(-2t))",
        "content": "For the case of Eq. (A1) the transition kernel is, Qt+s|s(x′|x)∝e^−1/2 (x′−µ)^T Σ^−1(x′−µ) µ=e^−tx Σ =σ²I(1−e^−2t) this solution can be verified by direct substitution into the corresponding Fokker-Planck equation.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Continuous Variables"
          },
          {
            "name": "equation",
            "value": "A3-A5"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The stationary distribution of the continuous case is zero-mean Gaussian noise with standard deviation σ",
        "content": "In the limit of infinite time,µ→0andΣ→σ²I. Therefore, the stationary distribution of this process is zero-mean Gaussian noise with a standard deviation ofσ.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Continuous Variables"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Discrete variable dynamics are described by Markov jump processes with generator L, where dQt/dt = LQt",
        "content": "The stochastic dynamics of some discrete variableXmay be described by the Markov jump process, dQt dt =LQ t where L is the generator of the dynamics, which is anM×Mmatrix that stores the transition rates between the various states.Q t is a lengthMvector that assigns a probability to each possible stateXmay take at timet.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Discrete Variables"
          },
          {
            "name": "equation",
            "value": "A6"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The transition rates between discrete states are given by L[j, i] = γ(−(M−1)δ j,i + (1−δ j,i)), where δ is the Kronecker delta function",
        "content": "The transition rate from theith state to thejth state is given by the matrix elementL[j, i], which here takes the particular form, L[j, i] =γ(−(M−1)δ j,i + (1−δ j,i)) whereδis used to indicate the Kronecker delta function. Eq. (A7) describes a random process where the probability per unit time to jump between any two states isγ.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Discrete Variables"
          },
          {
            "name": "equation",
            "value": "A7"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The dynamics of Qt can be understood through eigenvalues and eigenvectors of L, given by Lvk = λkvk",
        "content": "Since Eq. (A6) is linear, the dynamics ofQt can be understood entirely via the eigenvalues and eigenvectors ofL, Lvk =λ kvk",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "11"
          },
          {
            "name": "topic",
            "value": "Discrete Variables"
          },
          {
            "name": "equation",
            "value": "A8"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Denoising diffusion models learn to time-reverse a random process that converts data into simple noise through stochastic differential equations or Markov jump processes",
        "content": "Denoising diffusion models try to learn to time-reverse a random process that converts data into simple noise. Here, we will review some details on how these models work to support the analysis in the main text.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Forward processes in diffusion models convert data distribution into noise using different mathematical approaches for continuous and discrete variables",
        "content": "The forward process is a random process that is used to convert the data distribution into noise. This conversion into noise is achieved through a stochastic differential equation in the continuous-variable case and a Markov jump process in the discrete case.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Continuous variable diffusion models use Itô diffusion equation dX(t) = -X(t)dt + √2σdW to transform data into noise",
        "content": "In the continuous case, the typical choice of forward process is the Itô diffusion, dX(t) =−X(t)dt+√2σdW where X(t) is a length N vector representing the state variable at time t, σ is a constant, and dW is a length N vector of independent Wiener processes.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The transition kernel Qt|0(x'|x) defines how probability distributions evolve over time in continuous diffusion models",
        "content": "The transition kernel for a random process defines how the probability distribution evolves in time, Qt|0(x′|x) =P(X(t) =x ′|X(0) =x)",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For Itô diffusion, the transition kernel follows a Gaussian distribution with parameters μ = e^(-t)x and Σ = σ^2 I (1−e^(-2t))",
        "content": "For the case of Eq. (A1) the transition kernel is, Qt+s|s(x′|x)∝e −1/2 (x′−μ)T Σ−1(x′−μ) μ=e −tx Σ =σ2I 1−e −2t this solution can be verified by direct substitution into the corresponding Fokker-Planck equation.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The stationary distribution of continuous diffusion processes is zero-mean Gaussian noise with standard deviation σ in the limit of infinite time",
        "content": "In the limit of infinite time,μ→0andΣ→σ2I. Therefore, the stationary distribution of this process is zero-mean Gaussian noise with a standard deviation ofσ.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Discrete variable diffusion models use Markov jump processes described by dQt/dt = LQt where L is the dynamics generator matrix",
        "content": "The stochastic dynamics of some discrete variableXmay be described by the Markov jump process, dQt/dt =LQ t whereLis the generator of the dynamics, which is anM×Mmatrix that stores the transition rates between the various states.Q t is a lengthMvector that assigns a probability to each possible stateXmay take at timet.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Discrete diffusion transition rates between states are given by L[j,i] = γ(-(M-1)δj,i + (1-δj,i)) where γ is the jump probability rate",
        "content": "The transition rate from theith state to thejth state is given by the matrix elementL[j, i], which here takes the particular form, L[j, i] =γ(−(M−1)δj,i + (1−δj,i)) whereδis used to indicate the Kronecker delta function. Eq. (A7) describes a random process where the probability per unit time to jump between any two states isγ.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Markov jump process dynamics can be analyzed through eigenvalues and eigenvectors of the generator matrix L",
        "content": "Since Eq. (A6) is linear, the dynamics ofQt can be understood entirely via the eigenvalues and eigenvectors ofL, Lvk =λ kvk",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "12"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Discrete diffusion processes have one stationary eigenvector with eigenvalue 0 and M-1 decaying modes with negative eigenvalues λj = -γM",
        "content": "One eigenvector-eigenvalue pair(v0, λ0 = 0)corresponds to the unique stationary state ofL, with all entries ofv0 being equal to some constant (if normalized, thenv0[j] = 1/M for allj). The remaining eigenvectors are decaying modes associated with negative eigenvalues. These additionalM−1 eigenvectors take the form, vj[i] =−δi,0 +δi,j λj =−γM where Eq. (A9) and Eq. (A10) are valid forj∈[1, M−1]. Therefore, all solutions to this MJP decay exponentially to the uniform distribution with rateγM.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "13"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Time evolution of discrete diffusion processes is given by Qt = e^(Lt)Q0 where the matrix exponential is evaluated through diagonalization",
        "content": "The time-evolution ofQis given by the matrix exponential, Qt =e LtQ0.This matrix exponential is evaluated by diagonalizingL, eLt =P eDtP−1 where the columns ofPare theMeigenvectorsv k andDis a diagonal matrix of the eigenvaluesλk.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "13"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Matrix elements of discrete diffusion time evolution follow specific exponential forms involving delta functions and exponential decay terms",
        "content": "Using the solution for the eigenvalues and eigenvectors found above, we can solve for the matrix elements ofeLt, eLt [j, i] =δi,j (1 + (M−1)e −γMt/M) + (1−δi,j)(1−e −γMt/M) Using this solution, we can deduce an exponential form for the matrix elements ofeLt, eLt [j, i] = 1/Z(t) eΓ(t)δi,j Γ(t) = ln((1 + (M−1)e −γt)/(1−e −γt)) Z(t) = M/(1−e −γt)",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "13"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For multiple independent discrete variables, the joint distribution dynamics uses Kronecker products of individual diffusion operators",
        "content": "Now consider a process in which each element of the vector ofNdiscrete variablesXundergoes the dynamics described by Eq. (A6) independently. In that case, the differential equation describing the dynamics of the joint distributionQ t is, dQt/dt = NX k=1 (I1 ⊗ ··· ⊗ Lk ⊗. . . IN )Q t whereI j indicates the identity operator andLj the operator from Eq. (A7) acting on the subspace of thejth discrete variable.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "13"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Time evolution of joint distributions for multiple discrete variables uses the Kronecker product of individual matrix exponentials",
        "content": "The Kronecker product of the matrix exponentials gives the time-evolution of the joint distribution, eLt = NO k=1 eLkt",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix A: Denoising Diffusion Models"
          },
          {
            "name": "page",
            "value": "13"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For continuous diffusion models, the generator L corresponds to the Itô diffusion process and follows the Fokker-Planck equation structure with drift and diffusion terms.",
        "content": "In the case that the forward process is an Itô diffusion,Lis the generator for the corresponding Fokker-Planck equation,\nL=−\nX\ni\n∂\n∂xi\nfi(x, t) +1\n2\nX\ni,j\n∂\n∂xi\n∂\n∂xj\nDij(t)(A29)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A29)"
          },
          {
            "name": "type",
            "value": "continuous_diffusion"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The adjoint operator L† for continuous diffusion models has a specific form with drift and diffusion terms, derived using integration by parts.",
        "content": "Using Eq. (A28) and integration by parts, it can be shown that the adjoint operator is,\nL† =\nX\ni\nfi\n∂\n∂xi\n+ 1\n2\nX\ni,j\nDij\n∂\n∂xi\n∂\n∂xi\n(A30)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A30)"
          },
          {
            "name": "type",
            "value": "adjoint_operator"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The reverse process operator Lrev for continuous diffusion models can be reduced to a form involving a drift vector g and diffusion matrix D.",
        "content": "By directly substituting Eq. (A30) into Eq. (A27) and simplifying,Lrev can be reduced to,\nLrev =\nX\ni\n∂\n∂xi\ngi + 1\n2\nX\ni,j\n∂\n∂xi\n∂\n∂xj\nDij (A31)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A31)"
          },
          {
            "name": "type",
            "value": "reverse_operator"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The drift vector gi in continuous diffusion models is defined as fi(x,t) minus a term involving the gradient of the diffusion matrix weighted by the probability distribution Qt(x).",
        "content": "with the drift vectorg,\ngi(x, t) =fi(x, t)− 1\nQt(x)\nX\nj\n∂\n∂xj\n[Dij(x, t)Qt(x)](A32)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A32)"
          },
          {
            "name": "type",
            "value": "drift_vector"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For sufficiently small time steps Δt, the transition kernel in continuous diffusion models becomes Gaussian with mean μ and covariance matrix Σ defined by the drift and diffusion terms.",
        "content": "If∆tis chosen to be sufficiently small, Eq. (A32) can be linearized and the transition kernel is Gaussian,\nQt|t+∆t(x′|x)∝exp\n(\n−1\n2(x−µ) T Σ−1(x−µ)\n\n)\n(A33)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A33)"
          },
          {
            "name": "type",
            "value": "gaussian_kernel"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Gaussian mean vector μ is defined as x + Δt gi(x,t) and the covariance matrix Σ is Δt D(t) for continuous diffusion models with small Δt.",
        "content": "µ=x+ ∆t g i(x, t)(A34)\nΣ = ∆t D(t)(A35)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A34)-(A35)"
          },
          {
            "name": "type",
            "value": "gaussian_parameters"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Continuous diffusion models can achieve arbitrary approximation power in the small Δt limit by using neural networks to define the mean vector in Gaussian transitions.",
        "content": "Therefore, one can build a continuous diffusion model with arbitrary approximation power by working in the small∆t\nlimit and approximating the reverse process using a Gaussian distribution with a neural network defining the mean\nvector [1, 2].",
        "attributes": [
          {
            "name": "method",
            "value": "neural_network_approximation"
          },
          {
            "name": "scope",
            "value": "continuous_models"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For discrete diffusion models, the operator L has a tensor product form that guarantees L(x′, x) = 0 for any vectors with Hamming distance greater than one.",
        "content": "In a discrete diffusion model,Lis given by Eq. (A17). This tensor product form forLguarantees thatL(x′, x) = 0\nfor any vectorsx ′ andxthat have a Hamming distance greater than one (which means they have at leastN−1\nmatching elements).",
        "attributes": [
          {
            "name": "equation",
            "value": "(A17)"
          },
          {
            "name": "type",
            "value": "discrete_operator"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Neural networks can be effectively used in discrete diffusion models to approximate ratios of data distributions for neighboring states, enabling arbitrarily good approximations to the reverse process.",
        "content": "As such, in discrete diffusion models, neural networks trained to approximate ratios of the data distribution QT−s (x′)\nQT−s (x) for neighboringx ′ andxcan be used to implement an arbitrarily good approximation to the\nactual reverse process [3].",
        "attributes": [
          {
            "name": "method",
            "value": "neural_network_approximation"
          },
          {
            "name": "scope",
            "value": "discrete_models"
          },
          {
            "name": "reference",
            "value": "[3]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The adjoint operator L† for continuous diffusion models with Itô diffusion processes is mathematically defined as L† = Σᵢ fᵢ ∂/∂xᵢ + ½ Σᵢⱼ Dᵢⱼ ∂²/∂xᵢ∂xⱼ, where D is a symmetric matrix not dependent on x.",
        "content": "Using Eq. (A28) and integration by parts, it can be shown that the adjoint operator is,\nL† = Σᵢ fᵢ ∂/∂xᵢ + ½ Σᵢⱼ Dᵢⱼ ∂/∂xᵢ ∂/∂xᵢ (A30)",
        "attributes": [
          {
            "name": "section",
            "value": "Continuous variables"
          },
          {
            "name": "equation",
            "value": "(A30)"
          },
          {
            "name": "page",
            "value": "15"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "In the small Δt limit, continuous diffusion models can be constructed using Gaussian distributions with neural networks defining the mean vector, providing arbitrary approximation power.",
        "content": "Therefore, one can build a continuous diffusion model with arbitrary approximation power by working in the small Δt limit and approximating the reverse process using a Gaussian distribution with a neural network defining the mean vector [1, 2].",
        "attributes": [
          {
            "name": "section",
            "value": "Continuous variables"
          },
          {
            "name": "reference",
            "value": "[1, 2]"
          },
          {
            "name": "page",
            "value": "15"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The diffusion loss LDN(θ) is formulated as minimizing the distributional distance between the joint distributions of forward process Q₀,...,T and learned reverse process approximation Pθ₀,...,T, which can be simplified to a layerwise form for Markovian processes.",
        "content": "LDN (θ) = D(Q₀,...,T (·)||Pθ₀,...,T (·)) (A36)\nthe Markovian nature of Q can be taken advantage of to simplify Eq. (A36) into a layerwise form,\nLDN (θ) +C=− Σᵗ₌₁ᵀ E_Q(xₜ₋₁,xₜ) [log (Pθ(xₜ₋₁|xₜ)](A37)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A36), (A37)"
          },
          {
            "name": "section",
            "value": "The Diffusion Loss"
          },
          {
            "name": "page",
            "value": "16"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For discrete diffusion models, neural networks trained to approximate ratios of data distributions QT₋ₛ(x')/QT₋ₛ(x) for neighboring states x' and x can implement arbitrarily good approximations to the actual reverse process.",
        "content": "As such, in discrete diffusion models, neural networks trained to approximate ratios of the data distribution QT−s (x′)\nQT−s (x) for neighboringx ′ andxcan be used to implement an arbitrarily good approximation to the actual reverse process [3].",
        "attributes": [
          {
            "name": "reference",
            "value": "[3]"
          },
          {
            "name": "section",
            "value": "Discrete variables"
          },
          {
            "name": "page",
            "value": "15"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Conditional generation tasks like generating MNIST digits with specific class labels can be implemented by concatenating the target data with one-hot encoded labels and treating them as augmented training data for the denoising model.",
        "content": "In principle, this is very simple: we concatenate the target (in our case, the images) and a one-hot encoding of the labels into a contiguous binary vector and treat that whole thing as our training data on which we train the denoising model as described above.",
        "attributes": [
          {
            "name": "application",
            "value": "MNIST digit generation"
          },
          {
            "name": "section",
            "value": "Conditional Generation"
          },
          {
            "name": "page",
            "value": "17"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "The energy landscape of EBM-based approximations to reverse processes becomes simpler as the forward process timestep decreases, making sampling easier despite potential training challenges with noised labels during conditional inference.",
        "content": "However, during conditional inference, the models will have their label nodes clamped to an unnoised labell0, and they may not know how this should influence the generated image (and this problem would only be exacerbated if we clamped to a noised label instead).\nThis issue can be mitigated by using a rateγX when noising image entries in the training data and a different rateγL for noising label entries.",
        "attributes": [
          {
            "name": "recommendation",
            "value": "Use different noise rates for images and labels"
          },
          {
            "name": "section",
            "value": "Conditional Generation"
          },
          {
            "name": "page",
            "value": "17"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Figure 7 demonstrates that as λ increases (representing smaller forward process timesteps), the energy landscape transitions from a strongly bimodal distribution to a simple Gaussian centered at xt, making the distribution much easier to sample from.",
        "content": "The energy landscape is bimodal atλ= 0and gradually becomes distorted towards an unimodal distribution centered atx t asλincreases. This reshaping is intuitive, as shortening the forward process timestep should more strongly constrainx t−1 tox t.",
        "attributes": [
          {
            "name": "visualization",
            "value": "FIG. 7"
          },
          {
            "name": "section",
            "value": "Simplification of the Energy Landscape"
          },
          {
            "name": "page",
            "value": "16"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For energy-based models (EBMs) where Pθ(xₜ₋₁|xₜ) has no closed-form expression, Monte Carlo estimators must be employed to approximate the gradients of the diffusion loss, with the gradient derived as -Σᵗ₌₁ᵀ E_Q(xₜ₋₁,xₜ)[∇θ log Pθ(xₜ₋₁|xₜ)].",
        "content": "∇θLDN (θ) =−\nTX\nt=1\nEQ(xt−1,xt)\n[∇θ log\nPθ(xt−1|xt)\n]\n(A38)",
        "attributes": [
          {
            "name": "equation",
            "value": "(A38)"
          },
          {
            "name": "section",
            "value": "Monte-Carlo gradient estimator"
          },
          {
            "name": "page",
            "value": "16"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The diffusion loss function LDN(θ) is defined as the distributional distance between joint distributions of forward process Q₀,...,T and reverse process Pθ₀,...,T (Equation A36)",
        "content": "LDN (θ) =D\nQ0,...,T (·)||Pθ\n0,...,T (·)\n(A36)",
        "attributes": [
          {
            "name": "equation",
            "value": "A36"
          },
          {
            "name": "section",
            "value": "3"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Markovian nature of Q allows simplification of the diffusion loss into a layerwise form (Equation A37)",
        "content": "LDN (θ) +C=−\nTX\nt=1\nEQ(xt−1,xt) [log (Pθ(xt−1|xt)](A37)",
        "attributes": [
          {
            "name": "equation",
            "value": "A37"
          },
          {
            "name": "section",
            "value": "3"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "For denoising algorithms in the infinitesimal limit, gradients of LDN can be computed exactly due to the simple form of Pθ",
        "content": "For denoising algorithms that operate in the infinitesimal limit, the simple form of Pθ allows forLDN and its gradients to be computed exactly.",
        "attributes": [
          {
            "name": "section",
            "value": "3"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "When Pθ(xt-1|xt) is an EBM (Energy-Based Model), no closed-form expression exists for ∇θLDN(θ), requiring Monte Carlo estimation",
        "content": "In the case wherePθ\nxt−1|xt\n\nis an EBM, there exists no simple closed-form expression for∇θLDN (θ). In that case, one must employ a Monte Carlo estimator to approximate the gradient.",
        "attributes": [
          {
            "name": "section",
            "value": "3a"
          },
          {
            "name": "model_type",
            "value": "EBM"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Monte Carlo gradient estimator for diffusion loss is derived as shown in Equation (A38)",
        "content": "∇θLDN (θ) =−\nTX\nt=1\nEQ(xt−1,xt)\n        ∇θ log\n        Pθ(xt−1|xt)\n        \n    (A38)",
        "attributes": [
          {
            "name": "equation",
            "value": "A38"
          },
          {
            "name": "section",
            "value": "3a"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "With EBM parameterization, the gradient of log-likelihood can be simplified using latent variables as shown in Equation (A39)",
        "content": "∇θ log\n        Pθ(xt−1|xt)\n        \n    =E Pθ(xt−1,zt−1|xt)\n        ∇θEm\n        t−1\n        \n    −E Pθ(zt−1|xt−1,xt)\n        ∇θEm\n        t−1\n        \n    (A39)",
        "attributes": [
          {
            "name": "equation",
            "value": "A39"
          },
          {
            "name": "section",
            "value": "3a"
          },
          {
            "name": "model_type",
            "value": "EBM"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Smaller forward process timesteps lead to simpler energy landscapes in EBM-based approximations",
        "content": "As the forward process timestep is made smaller, the energy landscape of the EBM-based approximation to the reverse process becomes simpler.",
        "attributes": [
          {
            "name": "section",
            "value": "4"
          },
          {
            "name": "model_type",
            "value": "EBM"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The marginal energy function example shows Eθ_t-1(xt-1) = (xt-1² - 1)² (Equation A40)",
        "content": "Eθ\nt−1 (xt−1) =\n        x2\n        t−1 −1\n        \n    2\n(A40)",
        "attributes": [
          {
            "name": "equation",
            "value": "A40"
          },
          {
            "name": "section",
            "value": "4"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The forward process energy function for Gaussian diffusion is Ef_t-1(xt-1, xt) = λ((xt-1/xt) - 1)² (Equation A41)",
        "content": "Ef\nt−1 (xt−1, xt) =λ\n        xt−1\n        xt\n        −1\n        \n    2\n(A41)",
        "attributes": [
          {
            "name": "equation",
            "value": "A41"
          },
          {
            "name": "section",
            "value": "4"
          },
          {
            "name": "process_type",
            "value": "Gaussian diffusion"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Parameter λ scales inversely with forward process timestep size, approaching infinity as Δt → 0",
        "content": "The parameterλscales inversely with the size of the forward process timestep; that is,lim\n∆t→0\nλ=∞.",
        "attributes": [
          {
            "name": "section",
            "value": "4"
          },
          {
            "name": "parameter",
            "value": "λ"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The reverse process conditional energy landscape is Eθ_t-1 + Ef_t-1",
        "content": "The reverse process conditional energy landscape is thenEθ\nt−1 +E f\nt−1.",
        "attributes": [
          {
            "name": "section",
            "value": "4"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "The energy landscape transformation from bimodal at λ=0 to unimodal centered at xt as λ increases is intuitive because shorter forward process timesteps more strongly constrain xt-1 to xt",
        "content": "This reshaping is intuitive, as shortening the forward process timestep should more strongly constrainx t−1 tox t.",
        "attributes": [
          {
            "name": "section",
            "value": "4"
          },
          {
            "name": "sentiment",
            "value": "intuitive"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Figure 7 illustrates the effect of λ on the energy landscape, showing the gradual distortion from bimodal to unimodal distribution",
        "content": "The effect ofλon this is shown in Fig. 7.\nThe energy landscape is bimodal atλ= 0and gradually becomes distorted towards an unimodal distribution centered atx t asλincreases.",
        "attributes": [
          {
            "name": "section",
            "value": "4"
          },
          {
            "name": "reference",
            "value": "Figure 7"
          },
          {
            "name": "observation",
            "value": "bimodal to unimodal transition"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Experimental parameter ranges for good conditional generation performance: γL ∈ [0.1,0.3] and γX ∈ [0.7,1.5] for models with 4-12 steps",
        "content": "Experimentally, we observed that settings in the rangesγL ∈[0.1,0.3]andγ X ∈[0.7,1.5](for models with four to 12 steps) yielded good conditional generation performance while avoiding the freezing problem.",
        "attributes": [
          {
            "name": "methodology",
            "value": "experimental"
          },
          {
            "name": "parameter_range",
            "value": "γL: [0.1,0.3], γX: [0.7,1.5]"
          },
          {
            "name": "model_complexity",
            "value": "4-12 steps"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Theoretical equivalence between learned energy functions and true marginal distribution in DTM",
        "content": "If a DTM is trained to match the conditional distribution of the reverse process perfectly, the learned energy functionE θ t−1 is the energy function of the true marginal distribution, that is,Eθ t−1(x)∝logQ(x t−1).",
        "attributes": [
          {
            "name": "theoretical_basis",
            "value": "Bayes' rule"
          },
          {
            "name": "model_type",
            "value": "DTM (Diffusion Transition Model)"
          },
          {
            "name": "mathematical_relationship",
            "value": "Eθ t−1(x)∝logQ(x t−1)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Hardware architecture approach for EBMs using Probabilistic Graphical Models",
        "content": "In this work, we focus on a hardware architecture for EBMs that are naturally expressed as Probabilistic Graphical Models (PGMs). In a PGM-EBM, the random variables involved in the model map to the nodes of a graph, which are connected by edges that indicate dependence between variables.",
        "attributes": [
          {
            "name": "architecture_type",
            "value": "PGM-based EBM hardware"
          },
          {
            "name": "model_representation",
            "value": "Probabilistic Graphical Models"
          },
          {
            "name": "computational_approach",
            "value": "modular sampling"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Energy efficiency benefits of local PGM samplers over Von Neumann architectures",
        "content": "Since the sampling circuits only communicate locally, this type of computer will spend significantly less energy on communication than one built on a Von-Neumann-like architecture, which constantly shuttles data between compute and memory.",
        "attributes": [
          {
            "name": "efficiency_advantage",
            "value": "reduced communication energy"
          },
          {
            "name": "architecture_comparison",
            "value": "local vs Von Neumann"
          },
          {
            "name": "computational_model",
            "value": "compute-in-memory"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Gibbs sampling algorithm definition for PGM joint distribution sampling",
        "content": "Formally, the algorithm that defines this modular sampling procedure for PGMs is called Gibbs sampling. In Gibbs sampling, samples are drawn from the joint distributionp(x1, x2, . . . , xN )by iteratively updating the state of each node conditioned on the current state of its neighbors. For theith node, this means sampling from the distribution, xi[t+ 1]∼p(x i|nb(xi)[t]).",
        "attributes": [
          {
            "name": "algorithm",
            "value": "Gibbs sampling"
          },
          {
            "name": "sampling_method",
            "value": "iterative node conditioning"
          },
          {
            "name": "mathematical_formulation",
            "value": "xi[t+1]∼p(xi|nb(xi)[t])"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Experimental settings with gamma parameters γL ∈[0.1,0.3] and γX ∈[0.7,1.5] for models with 4-12 steps achieved good conditional generation performance while avoiding the freezing problem.",
        "content": "Experimentally, we observed that settings in the rangesγL ∈[0.1,0.3]andγ X ∈[0.7,1.5](for models with four to 12 steps) yielded good conditional generation performance while avoiding the freezing problem.",
        "attributes": [
          {
            "name": "source",
            "value": "Experimental observation"
          },
          {
            "name": "model_range",
            "value": "4-12 steps"
          },
          {
            "name": "gamma_L",
            "value": "[0.1,0.3]"
          },
          {
            "name": "gamma_X",
            "value": "[0.7,1.5]"
          },
          {
            "name": "performance",
            "value": "good conditional generation"
          },
          {
            "name": "issue_avoided",
            "value": "freezing problem"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "When a DTM is trained to perfectly match the conditional distribution of the reverse process, the learned energy function Eθ t−1 corresponds to the energy function of the true marginal distribution, satisfying Eθ t−1(x)∝logQ(x t−1).",
        "content": "If a DTM is trained to match the conditional distribution of the reverse process perfectly, the learned energy functionE θ t−1 is the energy function of the true marginal distribution, that is,Eθ t−1(x)∝logQ(x t−1).",
        "attributes": [
          {
            "name": "theoretical_basis",
            "value": "Bayes' rule application"
          },
          {
            "name": "model_type",
            "value": "DTM (Diffusion Transition Model)"
          },
          {
            "name": "condition",
            "value": "perfect match with true reverse process"
          },
          {
            "name": "mathematical_relationship",
            "value": "Eθ t−1(x)∝logQ(x t−1)"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Probabilistic Graphical Models (PGMs) provide a natural basis for hardware architecture for EBMs because they can be sampled using modular procedures that respect graph structure, enabling efficient hardware implementation with local communication.",
        "content": "PGMs form a natural basis for a hardware architecture because they can be sampled using a modular procedure that respects the graph's structure. Specifically, the state of a PGM can be updated by iteratively stepping through each node of the graph and resampling one variable at a time, using only information about the current node and its immediate neighbors.",
        "attributes": [
          {
            "name": "model_type",
            "value": "PGM-EBM (Probabilistic Graphical Model - Energy-Based Model)"
          },
          {
            "name": "sampling_method",
            "value": "Modular procedure respecting graph structure"
          },
          {
            "name": "advantage",
            "value": "Local communication, efficient hardware implementation"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Compute-in-memory approaches using local PGM sampling circuits can significantly reduce energy consumption compared to Von Neumann architectures by minimizing data communication between compute and memory components.",
        "content": "This localPGMsampler representsa type of compute-in-memory approach, where the stateof the sampling program is spatially distributed throughout the array of sampling circuitry. Since the sampling circuits only communicate locally, this type of computer will spend significantly less energy on communication than one built on a Von Neumann-like architecture, which constantly shuttles data between compute and memory.",
        "attributes": [
          {
            "name": "architecture_type",
            "value": "Compute-in-memory"
          },
          {
            "name": "advantage",
            "value": "Significantly less energy on communication"
          },
          {
            "name": "comparison",
            "value": "vs Von Neumann architecture"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Gibbs sampling is the formal algorithm that defines the modular sampling procedure for PGMs, where samples are drawn from the joint distribution by iteratively updating each node conditioned on its neighbors.",
        "content": "Formally, the algorithm that defines this modular sampling procedure for PGMs is called Gibbs sampling. In Gibbs sampling, samples are drawn from the joint distributionp(x1, x2, . . . , xN )by iteratively updating the state of each node conditioned on the current state of its neighbors. For theith node, this means sampling from the distribution, xi[t+ 1]∼p(x i|nb(xi)[t]).",
        "attributes": [
          {
            "name": "algorithm_name",
            "value": "Gibbs sampling"
          },
          {
            "name": "sampling_procedure",
            "value": "Iterative node updates conditioned on neighbors"
          },
          {
            "name": "mathematical_form",
            "value": "xi[t+ 1]∼p(x i|nb(xi)[t])"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Chromatic Gibbs Sampling enables parallel updates of graph nodes by grouping them into color classes where nodes of the same color do not neighbor each other, allowing simultaneous state updates.",
        "content": "Since each node's update distribution only depends on the state of its neighbors and because nodes of the same color do not neighbor each other, they can all be updated in parallel.",
        "attributes": [
          {
            "name": "section",
            "value": "Chromatic Gibbs Sampling"
          },
          {
            "name": "page",
            "value": "19"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Hardware acceleration of Gibbs sampling requires conditional updates to be efficiently implementable in the target hardware substrate, limiting the types of joint distributions that can be sampled.",
        "content": "The primary constraint around building a hardware device that implements Gibbs sampling is that the conditional update given in Eq. (B1) must be efficiently implementable. Generally, this means that one wants it to take a form that is 'natural' to the hardware substrate being used to build the computer.",
        "attributes": [
          {
            "name": "section",
            "value": "Quadratic EBMs"
          },
          {
            "name": "page",
            "value": "19"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Quadratic EBMs have energy functions that are quadratic in model variables, leading to conditional updates that are efficient to implement in various hardware types through simple sampling circuits biased by linear functions.",
        "content": "Quadratic EBMs have energy functions that are quadratic in the model's variables, which generally leads to conditional updates computed by biasing a simple sampling circuit (Bernoulli, categorical, Gaussian, etc.) with the output of a linear function of the neighbor states and the model parameters. These simple interactions are efficient to implement in various types of hardware.",
        "attributes": [
          {
            "name": "section",
            "value": "Quadratic EBMs"
          },
          {
            "name": "page",
            "value": "19"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Potts models extend Boltzmann machines to k-state variables, requiring softmax sampling circuits for hardware implementation rather than simpler Bernoulli sampling.",
        "content": "Therefore, to build a hardware device that samples from Potts models using Gibbs sampling, one would have to build a softmax sampling circuit parameterized by a linear function of the model weights and neighbor states. Potts model sampling is slightly more complicated than Boltzmann machine sampling, but it is likely possible.",
        "attributes": [
          {
            "name": "section",
            "value": "Potts models"
          },
          {
            "name": "page",
            "value": "20"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Gaussian-Bernoulli EBMs are more challenging to implement in hardware than discrete models because they require handling continuous signals between nodes, which can be done through discrete embedding or analog signaling but with significant overhead.",
        "content": "Hardware implementations of Gaussian-Bernoulli EBMs are more difficult than the strictly discrete models because the signals being passed during conditional sampling of the binary variables are continuous. To pass these continuous values, they must either be embedded into several discrete variables or an analog signaling system must be used. Both of these solutions would incur significant overhead compared to the purely discrete models.",
        "attributes": [
          {
            "name": "section",
            "value": "Gaussian-Bernoulli EBMs"
          },
          {
            "name": "page",
            "value": "20"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The denoising architecture in this work uses separate implementations for forward process and marginal energy functions, with the forward process implemented using simple pairwise couplings and the marginal energy function implemented using a grid-based Boltzmann machine.",
        "content": "The denoising models used in this work exclusively modeled distributions of binary variables. The reverse process energy function (Eq. 7 in the main text) was implemented using a Boltzmann machine. The forward process energy functionEf t−1 was implemented using a simple set of pairwise couplings betweenxt (blue nodes) andxt−1 (green nodes). The marginal energy functionEθ t−1 was implemented using a latent variable model (latent nodes are drawn in orange) with a sparse, local coupling structure.",
        "attributes": [
          {
            "name": "section",
            "value": "A hardware architecture for denoising"
          },
          {
            "name": "page",
            "value": "21"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The marginal energy function implementation uses a grid graph with nearest-neighbor and long-range skip connections, where some nodes represent data variables and others represent latent variables, creating a deep Boltzmann machine with sparse connectivity.",
        "content": "Within the grid, we randomly choose some subset of the nodes to represent the data variablesxt−1. The remaining nodes then implement the latent variablezt−1. The grid is, therefore, a deep Boltzmann machine with a sparse connectivity structure and multiple hidden layers.",
        "attributes": [
          {
            "name": "section",
            "value": "Implementation of the marginal energy function"
          },
          {
            "name": "page",
            "value": "21"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "The authors believe that while more complex than discrete models, Potts model sampling is likely implementable in hardware, though they suggest the experiments focused on simpler Boltzmann machines.",
        "content": "Potts model sampling is slightly more complicated than Boltzmann machine sampling, but it is likely possible.",
        "attributes": [
          {
            "name": "section",
            "value": "Potts models"
          },
          {
            "name": "page",
            "value": "20"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The document provides a comprehensive overview of different hardware-acceleratable EBM architectures, with detailed mathematical formulations for each type and specific implementation considerations.",
        "content": "Here, we will touch on a few other types of quadratic EBM that are more general. Although the experiments in this paper focused on Boltzmann machines, they could be trivially extended to these more expressive classes of distributions.",
        "attributes": [
          {
            "name": "section",
            "value": "Introduction to quadratic EBMs"
          },
          {
            "name": "page",
            "value": "20"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Potts models generalize Boltzmann machines to k-state variables using one-hot encoding where each variable xi has exactly one state xi^m = 1 and others are 0.",
        "content": "xi^m is a one-hot encoding of the state of variable xi, xi^m ∈ {0,1} (B5) ∑^M_m=1 xi^m = 1 (B6) which implies that xi^m = 1 for a single value of m, and is zero otherwise.",
        "attributes": [
          {
            "name": "model_type",
            "value": "Potts"
          },
          {
            "name": "encoding",
            "value": "one-hot"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Potts models have a softmax distribution for individual variables conditioned on their Markov blanket, which reduces to a simpler form when weight matrix J has symmetry J_ij^mn = J_ji^nm.",
        "content": "p(xi^m = 1|mb(xi)) ∝ 1/Z e^{-θ_i^m} (B9) θ_i^m = β(2∑_{j∈mb(xi),n} J_ij^mn x_j^n + h_i^m) (B10)",
        "attributes": [
          {
            "name": "distribution_type",
            "value": "softmax"
          },
          {
            "name": "condition",
            "value": "symmetric weights"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Hardware implementation of Potts models would require building a softmax sampling circuit parameterized by linear functions of model weights and neighbor states, which is more complex than Boltzmann machine sampling but likely feasible.",
        "content": "Therefore, to build a hardware device that samples from Potts models using Gibbs sampling, one would have to build a softmax sampling circuit parameterized by a linear function of the model weights and neighbor states. Potts model sampling is slightly more complicated than Boltzmann machine sampling, but it is likely possible.",
        "attributes": [
          {
            "name": "implementation",
            "value": "hardware"
          },
          {
            "name": "complexity",
            "value": "moderate"
          },
          {
            "name": "feasibility",
            "value": "likely possible"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Gaussian-Bernoulli EBMs extend Boltzmann machines to continuous, binary mixtures and can handle continuous-continuous, binary-binary, and binary-continuous interactions.",
        "content": "Gaussian-Bernoulli EBMs extend Boltzmann machines to continuous, binary mixtures. In general, this type of model can have continuous-continuous, binary-binary, and binary-continuous interactions.",
        "attributes": [
          {
            "name": "model_type",
            "value": "Gaussian-Bernoulli"
          },
          {
            "name": "extensions",
            "value": "continuous, binary mixtures"
          },
          {
            "name": "interaction_types",
            "value": "three types"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Quadratic EBMs beyond Boltzmann machines could trivially extend the experiments in this paper, though the focus was specifically on Boltzmann machines.",
        "content": "Although the experiments in this paper focused on Boltzmann machines, they could be trivially extended to these more expressive classes of distributions.",
        "attributes": [
          {
            "name": "scope",
            "value": "research experiments"
          },
          {
            "name": "extendibility",
            "value": "trivial"
          },
          {
            "name": "focus",
            "value": "Boltzmann machines"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The hardware denoising architecture uses a grid subdivided into visible nodes (representing variables x_{t-1}) and latent nodes (representing z_{t-1}), with blue nodes carrying values from previous denoising steps that remain fixed during Gibbs sampling.",
        "content": "A graph for hardware denoising. The grid is subdivided at random into visible (green) nodes, representing the variablesx t−1, and latent (orange) nodes, representingzt−1. Each visible nodext−1 j is coupled to a (blue) node carrying the value from the previous step of denoisingxt j (note that these blue nodes stay fixed throughout the Gibbs sampling).",
        "attributes": [
          {
            "name": "source",
            "value": "Fig. 9b description"
          },
          {
            "name": "section",
            "value": "Hardware Architecture"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The variational approximation to the reverse process conditional uses an energy function that combines the forward process energy function and the marginal energy function, implemented by adding nodes to the grid that are connected pairwise to data nodes.",
        "content": "As explicitly stated in Eq. 7 of the article, our variational approximation to the reverse process conditional has an energy function that is the sum of the forward process energy function and the marginal energy function. Physically, this corresponds to adding nodes to our grid that implementxt, which are connected pairwise to the data nodes implementingx t−1 via the coupling defined in Eq. (C1).",
        "attributes": [
          {
            "name": "source",
            "value": "Section after Table I"
          },
          {
            "name": "equation",
            "value": "Eq. 7"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The hardware architecture uses a random number generator (RNG) circuit that produces random bits at approximately 10MHz using approximately 350aJ of energy per bit.",
        "content": "We provide experimental measurements of our novel RNG circuitry in the main text, which establish that random bits can be produced at a rate ofτ−1 rng ≈10MHz using∼350aJ of energy per bit.",
        "attributes": [
          {
            "name": "source",
            "value": "Appendix D: Energetic analysis"
          },
          {
            "name": "measurement",
            "value": "Experimental"
          },
          {
            "name": "performance_metric",
            "value": "10MHz, 350aJ/bit"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The sampling cell design utilizes a linear analog circuit to combine neighboring states and model weights, producing a control voltage for an RNG that generates biased random bits based on a sigmoidal function of the control voltage.",
        "content": "The design considered here utilizes a linear analog circuit to combine the neighboring states and model weights, producing a control voltage for an RNG. This RNG then produces a random bit that is biased by a sigmoidal function of the control voltage. This updated state is then broadcast back to the neighbors.",
        "attributes": [
          {
            "name": "source",
            "value": "Appendix D: Sampling cell design"
          },
          {
            "name": "component",
            "value": "Linear analog circuit + RNG"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The document shows different connectivity patterns for various graph degrees (G8, G12, G16, G20, G24) with specific edge connections, indicating systematic scaling of the hardware architecture.",
        "content": "Pattern Connectivity\nG8 (0,1),(4,1)\nG12 (0,1),(4,1),(9,10)\nG16 (0,1),(4,1),(8,7),(14,9)\nG20 (0,1),(4,1),(3,6),(8,7),(14,9)\nG24 (0,1),(1,2),(4,1),(3,6),(8,7),(14,9)",
        "attributes": [
          {
            "name": "source",
            "value": "Table I"
          },
          {
            "name": "type",
            "value": "Connectivity patterns"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The RNG circuit output exhibits random wandering between high and low states, with the bias being a sigmoidal function of the control voltage, which is described as a critical feature for the system's operation.",
        "content": "Fig. 15 (a) shows an output voltage waveform from the RNG circuit. It wanders randomly between high and low states. Critically, the bias of the RNG circuit (the probability of finding it in the high or low state) is a sigmoidal function of its control voltage, which allows",
        "attributes": [
          {
            "name": "source",
            "value": "End of page 21"
          },
          {
            "name": "reference",
            "value": "Fig. 15(a)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Hardware denoising architecture uses a grid-based connectivity pattern that is repeated for every cell in the grid.",
        "content": "Our hardware denoising architecture (a)An example of a possible connectivity pattern as specified in Table. I. For clarity, the pattern is illustrated as applied to a single cell; however, in reality, the pattern is repeated for every cell in the grid.",
        "attributes": [
          {
            "name": "section",
            "value": "Hardware Architecture Overview"
          },
          {
            "name": "reference",
            "value": "Fig. 9(a)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The architecture implements visible nodes representing x_{t-1} and latent nodes representing z_{t-1}, with coupling between data nodes and previous denoising states.",
        "content": "(b)A graph for hardware denoising. The grid is subdivided at random into visible (green) nodes, representing the variables x_{t-1}, and latent (orange) nodes, representing z_{t-1}. Each visible node x_{t-1}_j is coupled to a (blue) node carrying the value from the previous step of denoising x^j_t (note that these blue nodes stay fixed throughout the Gibbs sampling).",
        "attributes": [
          {
            "name": "section",
            "value": "Hardware Architecture Overview"
          },
          {
            "name": "reference",
            "value": "Fig. 9(b)"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The variational approximation energy function is the sum of forward process energy function and marginal energy function.",
        "content": "As explicitly stated in Eq. 7 of the article, our variational approximation to the reverse process conditional has an energy function that is the sum of the forward process energy function and the marginal energy function.",
        "attributes": [
          {
            "name": "section",
            "value": "Variational Approximation"
          },
          {
            "name": "reference",
            "value": "Eq. 7"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The RNG design uses only transistors and integrates with traditional circuit components for large-scale sampling systems.",
        "content": "Our RNG design uses only transistors and can integrate tightly with other traditional circuit components on a chip to implement a large-scale sampling system. Since there are no exotic components involved that introduce unknown integration barriers, it is straightforward to build a simple physical model to predict how this device utilizes energy.",
        "attributes": [
          {
            "name": "section",
            "value": "RNG Design"
          },
          {
            "name": "confidence",
            "value": "high"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Unit sampling cells implement Boltzmann machine conditional updates as specified in Equation 11.",
        "content": "The performance of the device can be understood by analyzing the unit sampling cell that lives on each node of the PGM implemented by the hardware. The function of this cell is to implement the Boltzmann machine conditional update, as given in Eq. 11 in the main text.",
        "attributes": [
          {
            "name": "section",
            "value": "Unit Sampling Cell"
          },
          {
            "name": "reference",
            "value": "Eq. 11"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The sampling cell design uses linear analog circuit to combine neighboring states and model weights, producing control voltage for RNG.",
        "content": "There are many possible designs for the sampling cell. The design considered here utilizes a linear analog circuit to combine the neighboring states and model weights, producing a control voltage for an RNG. This RNG then produces a random bit that is biased by a sigmoidal function of the control voltage.",
        "attributes": [
          {
            "name": "section",
            "value": "Sampling Cell Design"
          },
          {
            "name": "design_type",
            "value": "linear analog circuit"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Experimental measurements show the RNG circuit produces random bits at approximately 10MHz rate with ~350aJ energy per bit.",
        "content": "We provide experimental measurements of our novel RNG circuitry in the main text, which establish that random bits can be produced at a rate of τ^{-1}_{rng} ≈10MHz using ~350aJ of energy per bit.",
        "attributes": [
          {
            "name": "section",
            "value": "Experimental Results"
          },
          {
            "name": "measurement_type",
            "value": "energy consumption"
          },
          {
            "name": "confidence",
            "value": "experimental"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The RNG circuit output voltage waveform wanders randomly between high and low states, with bias being a sigmoidal function of control voltage.",
        "content": "Fig. 15 (a) shows an output voltage waveform from the RNG circuit. It wanders randomly between high and low states. Critically, the bias of the RNG circuit (the probability of finding it in the high or low state) is a sigmoidal function of its control voltage, which allows",
        "attributes": [
          {
            "name": "section",
            "value": "RNG Circuit Behavior"
          },
          {
            "name": "reference",
            "value": "Fig. 15(a)"
          },
          {
            "name": "circuit_behavior",
            "value": "sigmoidal bias function"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "Document includes specific connectivity patterns for graphs of various degrees, with detailed edge mappings for different graph sizes.",
        "content": "Pattern Connectivity\nG8 (0,1),(4,1)\nG12 (0,1),(4,1),(9,10)\nG16 (0,1),(4,1),(8,7),(14,9)\nG20 (0,1),(4,1),(3,6),(8,7),(14,9)\nG24 (0,1),(1,2),(4,1),(3,6),(8,7),(14,9)\nTABLE I. Edges (ordered pairs) associated with graphs of various degrees.",
        "attributes": [
          {
            "name": "section",
            "value": "Connectivity Patterns"
          },
          {
            "name": "table_reference",
            "value": "Table I"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The sampling cell design supports initialization and readout operations (get/set state operations) in addition to the main Boltzmann update function.",
        "content": "The cell must also support initialization and readout (get/set state operations). A schematic of a unit cell is shown in Fig. 8.",
        "attributes": [
          {
            "name": "section",
            "value": "Sampling Cell Operations"
          },
          {
            "name": "reference",
            "value": "Fig. 8"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The fixed point voltage V∞b is calculated using conductance-weighted sum of bias voltages, with total conductance GΣ being the sum of all individual conductances Gj.",
        "content": "V ∞\nb =\nn+2X\nj=1\nGj\nGΣ\nVddyj (D4)\nwhere the total conductanceGΣ is,\nGΣ =\nn+2X\nj=1\nGj (D5)",
        "attributes": [
          {
            "name": "equation",
            "value": "D4-D5"
          },
          {
            "name": "type",
            "value": "circuit_parameter"
          },
          {
            "name": "voltage",
            "value": "V∞b"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The RNG bias curve follows a sigmoid function that implements model weights through conductance-weighted input terms and a bias term.",
        "content": "P(x i = 1) =σ\n(Vb\nVs\n−ϕ\n)\ninserting Eq. (D4) and expanding the term inside the sigmoid,\nVb\nVs\n−ϕ=\nnX\nj=1\nGj\nGΣ\nVdd\nVs\n(xj ⊕s j) +\n[Gn+1\nGΣ\nVdd\nVs\n−ϕ\n]\n(D7)\nby comparison to the Boltzmann machine conditional, we can see that the first term implements the model weights\n(which can be positive or negative given an appropriate setting of the sign bitsj), and the second term implements\na bias.",
        "attributes": [
          {
            "name": "equation",
            "value": "D6-D7"
          },
          {
            "name": "type",
            "value": "bias_curve"
          },
          {
            "name": "function",
            "value": "sigmoid"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Static power consumption in the circuit is proportional to voltage squared and depends on an input-dependent constant γ that represents conductance-weighted sum of yj values.",
        "content": "The static power drawn by this circuit can be written in the form,\nP∞ = C\nτbias\nV 2\ndd(1−γ)γ(D8)\nwhere0≤γ≤1is the input-dependent constant,\nγ=\nn+2X\nj=1\nGj\nGΣ\nyj (D9)",
        "attributes": [
          {
            "name": "equation",
            "value": "D8-D9"
          },
          {
            "name": "type",
            "value": "power_consumption"
          },
          {
            "name": "parameter",
            "value": "γ"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Energy consumed by the bias circuit is primarily due to static power dissipation, with maximum energy consumption occurring when γ = 1/2.",
        "content": "This fixed point must be held while the noise generator relaxes, which means that the energetic cost of the biasing\ncircuit is approximately,\nEbias ≈P∞τrng\n=C τrng\nτbias\nV 2\ndd(1−γ)γ (D10)\nThis is maximized forγ= 1\n2 .\nTo avoid slowing down the sampling machine,τrng\nτbias\n≫1. As such, ignoring the energy spent charging the capacitor\n∼ 1\n2 CV 2\nb will not significantly affect the results, and the approximation made in Eq. (D10) should be accurate. The\nenergy consumed by the bias circuit is primarily due to static power dissipation.",
        "attributes": [
          {
            "name": "equation",
            "value": "D10"
          },
          {
            "name": "type",
            "value": "energy_consumption"
          },
          {
            "name": "maximum",
            "value": "γ=1/2"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Communication energy between neighboring cells is determined by wire capacitance and signaling voltage, with energy required being proportional to Cwire × Vsig².",
        "content": "In most electronic devices, signals are communicated by charging and discharging wires. Charging a wire\nrequires the energy input,\nEcharge = 1\n2CwireV 2\nsig (D11)\nwhereC wire is the capacitance associated with the wire, which grows with its length, andVsig is the signaling voltage\nlevel.",
        "attributes": [
          {
            "name": "equation",
            "value": "D11"
          },
          {
            "name": "type",
            "value": "communication_energy"
          },
          {
            "name": "dependency",
            "value": "wire_length"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Wire capacitance per unit length in the process is approximately 350aF/µm, with total capacitance for node connections calculated using geometric components of connection rules.",
        "content": "Given the connectivity patterns shown in table I, it is possible to estimate the total capacitanceCn associated\nwith the wire connecting a node to all of its neighbors,\nCn = 4ηℓ\nX\ni\nq\na2\ni +b 2\ni (D12)\nwhereℓ≈6µmis the sampling cell side length, andη≈350aF/µmis the wire capacitance per unit length in our process, see Fig. 11 (b).ai andb i are thexandycomponents of thei th connection rule, as described in section C.2.",
        "attributes": [
          {
            "name": "equation",
            "value": "D12"
          },
          {
            "name": "type",
            "value": "capacitance_estimation"
          },
          {
            "name": "parameters",
            "value": "ℓ=6µm, η=350aF/µm"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Global clock distribution requires signal transmission over long wires with large capacitance, with a simple clock distribution scheme requiring total wire length proportional to the number of rows times row length.",
        "content": "Several systems on the chip require signals to be transmitted from some central location out to the individual\nsampling cells. This communication involves sending signals over long wires with a large capacitance, which is\nenergetically expensive. Here, the cost of this global communication will be taken into consideration.\na. Clocking\nAlthough it is possible in principle to implement Gibbs sampling completely asynchronously, in practice, it is more\nefficient to implement standard chromatic Gibbs sampling with a global clock. A global clock requires a signal to\nbe distributed from a central clock circuit to every sampling cell on the chip. This signal distribution is typically\naccomplished using a clock tree, a branching circuit designed to minimize timing inconsistencies between disparate\ncircuit elements.\nTo simplify the analysis, we will consider a simple clock distribution scheme in which the clock is distributed by\nlines that run the entire length of each row in the grid. The total length of the wires used for clock distribution in\nthis scheme is,\nLclock =N L(D13)",
        "attributes": [
          {
            "name": "equation",
            "value": "D13"
          },
          {
            "name": "type",
            "value": "clock_distribution"
          },
          {
            "name": "scheme",
            "value": "row-based"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The circuit uses resistors to implement multiply-accumulate operations required by conditional update rules, with resistor conductance tuning needed for specific weight and bias sets.",
        "content": "Section D.1 discusses a simple circuit that uses resistors to implement the multiply-accumulate required by the conditional update rule. Key to this is being able to tune the conductance of the resistors to implement specific sets of weights and biases (see Eq. (D7)).",
        "attributes": [
          {
            "name": "section",
            "value": "D.1"
          },
          {
            "name": "component",
            "value": "resistor circuit"
          },
          {
            "name": "function",
            "value": "multiply-accumulate"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Memory on-chip storage of model parameters is required for implementing resistor tunability, with writing operations consuming significantly more energy than state maintenance.",
        "content": "Practically, implementing this tunability requires that the model parameters be stored in memory somewhere on the chip. Writing to and maintaining these memories costs energy. Writing to the memories uses much more energy than maintaining the state.",
        "attributes": [
          {
            "name": "section",
            "value": "D.1"
          },
          {
            "name": "memory_type",
            "value": "on-chip"
          },
          {
            "name": "energy_comparison",
            "value": "writing >> maintenance"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Infrequent memory programming (program once, run many sampling programs) makes maintenance energy cost dominant and minimal compared to sampling cell costs.",
        "content": "However, if writes are infrequent (program the device once and then run many sampling programs on it before writing again), then the overall cost of the memory is dominated by maintenance. Luckily, most conventional memories are specifically designed to consume as little energy as possible when not being accessed.",
        "attributes": [
          {
            "name": "section",
            "value": "D.1"
          },
          {
            "name": "usage_pattern",
            "value": "infrequent writes"
          },
          {
            "name": "cost_dominance",
            "value": "maintenance dominates"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Memory maintenance energy costs are negligible at system level compared to sampling cell costs and do not significantly affect overall energy outcomes.",
        "content": "As such, in practice, the cost of memory maintenance is small compared to the other costs associated with the sampling cells and does not significantly change the outcome shown in Fig. 12.",
        "attributes": [
          {
            "name": "section",
            "value": "D.1"
          },
          {
            "name": "impact_level",
            "value": "system level"
          },
          {
            "name": "cost_significance",
            "value": "negligible"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Off-chip communication costs depend heavily on system integration tightness and were analyzed only at chip edge as a conservative lower bound.",
        "content": "The cost of this communication depends strongly on the tightness of integration between the two systems and is impossible to reason about at an abstract level. As such, the analysis of communication here (as in Section D3b) was limited to the cost of getting bits out to the edge of our chip, which is a lower bound on the actual cost.",
        "attributes": [
          {
            "name": "section",
            "value": "D.2"
          },
          {
            "name": "analysis_scope",
            "value": "chip edge only"
          },
          {
            "name": "bound_type",
            "value": "lower bound"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Detailed PCB-mediated chip communication analysis shows system-level results remain unchanged due to long-running sampling programs.",
        "content": "However, we have found that a more detailed analysis, which includes the cost of communication between two chips mediated by a PCB, does not significantly change the results at the system level.",
        "attributes": [
          {
            "name": "section",
            "value": "D.2"
          },
          {
            "name": "analysis_detail",
            "value": "PCB-mediated"
          },
          {
            "name": "result_impact",
            "value": "no significant change"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Sampling programs run many iterations before mixing and sending results, causing discrepancy between Esamp and E_init + E_read metrics.",
        "content": "The fundamental reason for this is that sampling programs for complex models run for many iterations before mixing and sending the results back to the outside world. This is reflected in the discrepancy betweenEsamp andE init +E read found in section D.4.",
        "attributes": [
          {
            "name": "section",
            "value": "D.2"
          },
          {
            "name": "program_behavior",
            "value": "many iterations"
          },
          {
            "name": "metric_discrepancy",
            "value": "Esamp vs E_init + E_read"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Architectural heterogeneity enables sharing of supporting circuitry among sampling cells, dramatically reducing per-cell energy costs.",
        "content": "Due to the heterogeneity of our architecture, it is possible to share most of the supporting circuitry among many sampling cells, which dramatically reduces the per-cell cost.",
        "attributes": [
          {
            "name": "section",
            "value": "D.3"
          },
          {
            "name": "architecture_type",
            "value": "heterogeneous"
          },
          {
            "name": "cost_reduction",
            "value": "dramatic per-cell reduction"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Supporting circuitry energy costs are insignificant at system level due to architectural sharing capabilities.",
        "content": "As such, the energy cost of the supporting circuitry is not significant at the system level.",
        "attributes": [
          {
            "name": "section",
            "value": "D.3"
          },
          {
            "name": "cost_level",
            "value": "system level"
          },
          {
            "name": "significance",
            "value": "insignificant"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "NVIDIA A100 GPU experiments used Zeus tool for empirical energy measurement and FLOPS-based theoretical estimation.",
        "content": "All experiments shown in Fig. 1 in the article were conducted on NVIDIA A100 GPUs. The empirical estimates of energy were conducted by drawing a batch of samples from the model and measuring the GPU energy consumption and time via Zeus [5]. The theoretical energy estimates were derived by taking the number of model FLOPS (via JAX and PyTorch's internal estimators) and plugging them into the NVIDIA GPU specifications (19.5 TFLOPS for Float32 and 400W).",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix E"
          },
          {
            "name": "hardware",
            "value": "NVIDIA A100 GPU"
          },
          {
            "name": "measurement_method",
            "value": "Zeus + FLOPS estimation"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Empirical and theoretical GPU energy measurements show good alignment, validating the theoretical estimation approach.",
        "content": "The empirical measurements are compared to theoretical estimates for the VAE in Table II, and the empirical measurements show good alignment with the theoretical.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix E"
          },
          {
            "name": "model_type",
            "value": "VAE"
          },
          {
            "name": "alignment",
            "value": "good empirical-theoretical"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "GPU energy efficiency data shows empirical measurements consistently higher than theoretical estimates but within reasonable range.",
        "content": "FID Empirical Efficiency Theoretical Efficiency\n30.5 6.1×10 −5 2.3×10 −5\n27.4 1.5×10 −4 0.4×10 −4\n17.9 2.5×10 −3 1.7×10 −3",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix E"
          },
          {
            "name": "data_type",
            "value": "energy efficiency"
          },
          {
            "name": "units",
            "value": "joules per sample"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The research focuses on relative energy consumption scales rather than state-of-the-art performance, using ResNet and UNet style architectures consistent with literature values.",
        "content": "The models were derived from available implementations and are based on ResNet [6] and UNet [7] style architectures. Their FID performance is consistent with published literature values [8–10]. The goal is not to achieve state of the art performance, but to represent the relative scales of energy consumption of the algorithms.",
        "attributes": [
          {
            "name": "section",
            "value": "Appendix E"
          },
          {
            "name": "architecture_style",
            "value": "ResNet/UNet"
          },
          {
            "name": "research_focus",
            "value": "relative energy scales"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Diffusion models are substantially less energy-efficient than VAEs due to requiring multiple runs (dozens to thousands of times) to generate a single sample, whereas VAE decoders typically run once.",
        "content": "The reader may be surprised to see that the diffusion model is substantially less energy-efficient than the VAE given the relative dominance in image generation. However, two points should be kept in mind. First, while VAE remains a semi-competitive model for these smaller datasets, this quickly breaks down. On larger datasets, a FID performance gap usually exists between diffusion models and VAEs. Second, these diffusion models (based on the original DDPM [2]) have performance that can depend on the number of diffusion time steps. So, not only is the UNet model often larger than a VAE decoder, but it also must be run dozens to thousands of times in order to generate a single sample (thus resulting in multiple orders of magnitude more energy required). Modern improvements, such as distillation [11], may move the diffusion model energy efficiency closer to the VAE's.",
        "attributes": [
          {
            "name": "model_comparison",
            "value": "diffusion_models_vs_vae"
          },
          {
            "name": "energy_consumption",
            "value": "multiple_orders_of_magnitude"
          },
          {
            "name": "technical_detail",
            "value": "UNet_vs_VAE_decoder"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "Modern improvements like distillation may help diffusion models achieve energy efficiency closer to VAE levels.",
        "content": "Modern improvements, such as distillation [11], may move the diffusion model energy efficiency closer to the VAE's.",
        "attributes": [
          {
            "name": "improvement_suggestion",
            "value": "distillation"
          },
          {
            "name": "optimistic_outlook",
            "value": "positive"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The total correlation penalty gradients can be computed using the same samples used to estimate the gradient of the usual loss in training.",
        "content": "The total correlation penalty is a convenient choice in this context because its gradients can be computed using the same samples used to estimate the gradient of the usual loss used in training,∇θLDN .",
        "attributes": [
          {
            "name": "computational_efficiency",
            "value": "sample_reuse"
          },
          {
            "name": "optimization",
            "value": "gradient_computation"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The Adaptive Correlation Penalty (ACP) scheme dynamically adjusts λt based on an estimate of the model's current mixing time using autocorrelation of the Gibbs sampling chain as a proxy.",
        "content": "To address this, we employ an Adaptive Correlation Penalty (ACP) scheme that dynamically adjustsλt based on an estimate of the model's current mixing time. We use the autocorrelation of the Gibbs sampling chain,rt yy, as a proxy for mixing, as described in Section H and the main text, Eq. 18.",
        "attributes": [
          {
            "name": "method_name",
            "value": "ACP"
          },
          {
            "name": "control_parameter",
            "value": "λt"
          },
          {
            "name": "proxy_metric",
            "value": "autocorrelation"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "The ACP algorithm uses a simple layerwise procedure with four steps: estimate current autocorrelation, set minimum lambda value, update lambda based on autocorrelation comparison, and ensure lambda doesn't go below minimum.",
        "content": "A simple layerwise procedure is used for this control. The inputs to the algorithm are the initial values ofλt, a target autocorrelation thresholdεACP (e.g.,0.03), an update factorδ ACP (e.g.,0.2) and a lower limitλmin t (e.g.,0.0001).\nAt the end of each training epochm:\n1. Estimate the current autocorrelationat m =r t yy[K]. This estimate can be done by running a longer Gibbs chain periodically and calculating the empirical autocorrelation from the samples.\n2. Setλ′ t =max(λmin t , λ(m) t )to avoid getting stuck at 0.\n3. Updateλ t for the next epoch (m+ 1) based onat m and the previous valueat m−1 (ifm >0):\n•Ifa t m < εACP: The chain mixes sufficiently fast; reduce the penalty slightly.\nλ(m+1) t ←(1−δ ACP)λ′ t\n•Else ifa t m ≥εACP anda t m ≤a t m−1 (orm= 0): Mixing is slow but not worsening (or baseline); keep the penalty strength.\nλ(m+1) t ←λ′ t\n•Else (a t m > εACP anda t m > at m−1): Mixing is slow and worsening; increase the penalty.\nλ(m+1) t ←(1 +δ ACP)λ′ t\n4. If the proposed valueλ(m+1) t < λmin t , then setλ(m+1) t ←0.",
        "attributes": [
          {
            "name": "algorithm_complexity",
            "value": "layerwise_procedure"
          },
          {
            "name": "update_logic",
            "value": "conditional_adjustments"
          }
        ]
      },
      {
        "type": "opinion",
        "insight": "The simple feedback mechanism of the ACP algorithm works effectively and is vastly more efficient than manual hyperparameter searches.",
        "content": "Our experiments indicate that this simple feedback mechanism works effectively. Whileλt and the autocorrelation at m might exhibit some damped oscillations for several epochs before stabilizing this automated procedure is vastly more efficient than performing manual hyperparameter searches forλt for each of theTmodels.",
        "attributes": [
          {
            "name": "efficiency_claim",
            "value": "vastly_more_efficient"
          },
          {
            "name": "automation_benefit",
            "value": "reduced_manual_tuning"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Training is relatively insensitive to the exact choice of εACP within [0.02,0.1] and δACP within [0.1,0.3], and λmin t within [0.001,0.00001].",
        "content": "Training is relatively insensitive to the exact choice ofεACP within a reasonable range (e.g.,[0.02,0.1]) andδ ACP (e.g.,[0.1,0.3]). Assuming that over the course of training theλ t parameter settles around some valueλ ∗ t , one should aim for the lower bound parameterλmin t to be smaller than1 2 λ∗ t , while making sure that the ramp-up time log(λ∗ t )−log(λmin t ) log(1+δACP) remains small. Settings ofλmin t in the range[0.001,0.00001]all produced largely the same result, the only difference being that values on the lower end of that range led to a larger amplitude in oscillations ofλt andat m, but training eventually settled for all values.",
        "attributes": [
          {
            "name": "hyperparameter_robustness",
            "value": "wide_range_tolerance"
          },
          {
            "name": "parameter_ranges",
            "value": "εACP[0.02,0.1], δACP[0.1,0.3], λmin[0.001,0.00001]"
          }
        ]
      },
      {
        "type": "fact",
        "insight": "Continuous data can be embedded into binary variables by representing a k-state categorical variable Xi using the sum of k binary variables Zki.",
        "content": "In some of our experiments, we needed to embed continuous data into binary variables. We chose to do this by representing ak-state categorical variableXi using the sumkbinary variablesZ k i , Xi = KiX k=1 Z(k) i (G1) whereZ (k) i ∈ {0,1}.",
        "attributes": [
          {
            "name": "embedding_technique",
            "value": "binary_representation"
          },
          {
            "name": "variable_type",
            "value": "categorical_to_binary"
          }
        ]
      },
      {
        "type": "comment",
        "insight": "The document appears to be from a machine learning research paper focusing on diffusion models, VAEs, and novel regularization techniques like the Adaptive Correlation Penalty scheme.",
        "content": "The reader may be surprised to see that the diffusion model is substantially less energy-efficient than the VAE given the relative dominance in image generation. However, two points should be kept in mind. First, while VAE remains a semi-competitive model for these smaller datasets, this quickly breaks down. On larger datasets, a FID performance gap usually exists between diffusion models and VAEs. Second, these diffusion models (based on the original DDPM [2]) have performance that can depend on the number of diffusion time steps. So, not only is the UNet model often larger than a VAE decoder, but it also must be run dozens to thousands of times in order to generate a single sample (thus resulting in multiple orders of magnitude more energy required). Modern improvements, such as distillation [11], may move the diffusion model energy efficiency closer to the VAE's.",
        "attributes": [
          {
            "name": "document_type",
            "value": "research_paper"
          },
          {
            "name": "field",
            "value": "machine_learning"
          },
          {
            "name": "focus_area",
            "value": "diffusion_models_and_regularization"
          }
        ]
      }
    ]
  }
}