tony's blog

Diffusion Transformer from scratch

2026-04-21T00:00:00-04:00

Diffusion Transformer from scratch

In my journey of understanding the state of the art in end to end robotic policy training, I found a very very common paradigm that a lot of research labs have converted to which is to pivot towards diffusion-based models to model action chunks for robots to perform(DIFFUSION POLICY REF, PI 0 REF). Many of you reading are probably familiar with diffusion models in the context of image generation, and robot training policies are merely borrowing these same architectures. Initially, diffusion models were CNN based, but with the rise of transformers, they have found their way to sneak in as well: Enter the Diffusion Transformer. I have bestowed upon myself the task of reimplementing the paper that has introduced this architecture: Scalable Diffusion Models with Transformers (REF FOR DIFFUSION TRANSFORMER). In this paper, they showed that Diffusion Transformers had scalability properties and outperformed all prior diffusion models. For this implementation, I will be staying within the image generation space. (Pi 0 implementation coming soon!)

The implementation of Diffusion Transformer will be split into it's constituent parts as Pytorch modules and then assembled as one DiT module. The parts consist of: Patchify, Conditionning Embeddings, DiT Block with adaLN-Zero, and output projection layers.

However, the paper does not include the VAE that compresses the original image into a latent representation and instead opts into using a frozen pretrained VAE. This VAE allows the Diffusion Transformer to work in a smaller dimension, which reduces computational costs and makes training efficient.

Patchify

In order for Transformers to operate on images, we need to ensure that the images are of the shape that Transformers love, which look like a sequence of tokens with an embedding/hidden dimension (e.g. (batch, seq_len, dim)). However, images are of the shape (batch, channels, height, width). This can be changed by applying a Conv2d layer followed by a reshape. The Conv2d layer allows us to regroup squares of pixels into patches which get treated like tokens later on in the attention operations. Also, it allows for variable patch sizes, bigger patches will yield less "tokens", thus reducing compute complexity. For example, for a latent of shape (4, 32, 32), a patch size of 2, a Conv2d(4, 1152, patch_size, patch_size), passing the latent through the convolutional layer yields an output of shape (4, 1152, 16, 16). This is not the desired shape yet, and a reshape operation must be applied. Calling einsum.reshape(latent, 'b d h w -> b (h w) d') flattens the height and width dimension and puts the hidden dimension to the end, yielding something similar to that of an embedded token sequence. Now, this tranformed latent representation can undergo attention operations, with image patches that perform bidirectional attention on one another.

Conditionning embedding

The diffusion model is conditionned by 2 things: the noise timestep and the text label. It is important to turn these into the right shapes before feeding them into the DiT block later on to condition the block output. The noise timestep is of shape (1, B), a 1D vector of timesteps per batch item. The text label is (1, B), one text label per batch item. It is important to note that in this specific paper, the DiT is trained on ImageNet, which has categorical labels instead of free text strings like in popular diffusion models. The timmesteps and labels are embedded differently. The paper authors have stated that the embedding methods have been taken from ADM (ADM ref)

Labels: The labels are embedded with a learned embedding table. It is also implemented in a manner that allows for Classifier-Free Guidance, where some labels are dropped and is assigned to a "Label-Free" generic label. By stripping the label from the prediction, this helps the diffusion model generalize.
Timesteps: ( Apologies in advance for the my attempt at explaining this in English :( ) They are embedded differently than categorical data types that simply index into a table. Timesteps in this case can be fractional and a different method is used. A variant of sine-cosine embedding is used. Key differences include the timesteps being multiplied to the arguments of the sine and cosine and that the sines and cosines aren't weaved one after and are instead concatenated along the hidden dimension. Also, this sine-cosine layer operated at a different dimension than the embedding dimension of the model, and a projection layer is applied at the end to return to the model dimension. In a sense, it is a combination of sin-cos embeddings and learned embeddings.

DiT Block

The DiT block is architecturally identical to the standard Transformer block, with the exception of the conditioning mechanism. I will not dive deep into the attention and the feedforward network as there are a multitude of guides out there already that explains them in depth with competence. Seriously, the transformer can be it's own blog post series and I will leave that as an exercise to the reader ;-). In addition to the regular transformer, there are conditionning layers that allow us to inject the timestep and label embeddings that we previously created into the transformer block, steering the output to be conditionned by the timestep and label. The technique used is adaptive layernorm (adaLN) with zero initialization (makes it adaLN-Zero). The condition embeddings are projected into a dimention of 6 * dim and then chunked into 6 variables gamma_1, gamma_2, beta_1, beta_2, alpha_1, alpha_2. These variables are then used to scale and shift the output of the transformer block. The scale-shift with gamma being the scalar and beta being the shift happens before every Multi-Head Attention and FeedForward operation. Another scale operation with alpha is applied after the self-attention and feedforward operations. The zero part in adaLN-Zero comes from the zeroing of the projection MLP weights (the 6 * dim projection one). The adaLN-Zero method is the chosen by the authors over cross-attention and simply concatenating the embeddings since they have found it to work best in terms of performance in FID score and compute-efficiency.

Output Projection and Unpatchify

This part is responsible for converting the transformer outputs back into the image shape. A final layer applies a projection from the hidden size to patch_size * patchsize * out_channels which gets transformed into (B, C, H, W) in the unpatchify layer. With the final output, it is then possible to compute the loss as it is the same shape as the initial image.

Putting it all together

Chaining all of these components together yields the Diffusion Transformer! It accepts in an image/latent and produces a noise prediction (or a velocity field if you using a flow matching objective).

Training & sampling

I think the training and especially sampling deserves it's own blog post because of it's high-ish complexity (or my lack of clarity!). Stay tuned for DDPM/DDIM vs Flow Matching.

Conclusion

I have learned a lot from trying to implement this architecture from scratch. Embedding the inputs and conditionning were things I have not worked on before coming from an LLM background. Otherwise I am surprised at how many similarities there are between this and the CNN diffusion model in terms of training and architectural similarities with LLMs (Transformers supremacy).

References

What I am working on currently

2025-09-24T00:00:00-04:00

It has been a few months since my first and only blog post, and I'd love to get back into writing and sharing once again. I know that writing does me a lot of good in verbalizing my thoughts clearly. Since it has been so long, this blog is an update to what I have been working on throughout this summer.

Vision Language Action Models (VLAs) and Robotics

I am grateful to have been given the opportunity to work in an ML robotics project, which was something that I haven't anticipated beforehand. This project consists of controlling a robot to perform an action based on a verbal input. This leverages VLAs that produce actions that predict what actions the robot should take. VLAs are composed of 2 parts, a VLM (Vision Language Model) that processes text and image inputs and a Diffusion model that generates multiple actions. The architecture that we are currently using is SmolVLA. The lerobot library is great for most of the tasks that we need to do. They include: recording data, training a policy, and running asynchronous inference. The robotics space is riddled with non-AI problems with hardware and Linux that I have had the privilege to never encounter beforehand.

LLM finetuning research

I am also active in LLM finetuning research with the Local Research Group in the fast.ai Discord. This project consists of comparing the efficacy of different finetuning techniques (full finetuning, LoRA, rsLoRA, DoRA) on domains of math and coding. We are looking for improvements whilst also retaining base model capabilities. I was tasked with model evaluation. For this, I used lm-evaluation-harness for evaluation with VLLM support. Other ongoing tasks within the team are: custom modeling for efficient training, data decontamination, chat templates, etc.

Personal Learning

On the side, I am learning RL algorithms alongside Clusters of Stars, a group within the fast.ai Discord group. I have learned and implemented DQN (Deep Q Networks), Policy Gradient, and A2C (Actor-Critic) from scratch with PyTorch: RL Implementations. This was to build my RL foundation that I have largely disregarded due to the scary math. Given enough time thinking and coding, these scary math notations become slightly less scary as my intuition develops (still super scary though!). This RL knowledge is crucial towards understanding how modern RL for post-training (PPO, DPO, GRPO) works.

Looking Forward

These experiences have been incredibly valuable in expanding my technical skills while also teaching me the importance of persistence when facing complex mathematical concepts. The intersection of robotics, LLMs, and RL continues to fascinate me, and I'm excited to share more detailed technical posts about these projects in the future.

Welcome to My Blog

2025-04-18T00:00:00-04:00

Welcome to My Blog

Hello to whoever is reading this! You're currently on my first ever blog post!

I am no big writer and this is my first genuine attempt at creating and publishing posts that are meaningful to me. Writing has always been intimidating to me. I've always hated, it but this will serve as a starting point towards being more comfortable putting my ideas into words and sharing them online to super duper cool readers like you.

The main reason why I have decided to start blogging seriously is that I want to value creation over passive consumption and to keep track my learning and ideas.

Who am I?

Currently, I am CS student entering McGill University in Montreal Canada. I'm interested in AI, specifically anything related to LLMs whether that be research or engineering. I love to train models! I try my best, but it is a daunting task with many pitfalls. I am also a RAG enthusiast; I'd love to learn more about it and develop apps or systems.

What to expect

There is nothing set in stone for the near future as of writing this. Although one promise that I'll try to uphold is sharing technical things like my learning journey in AI and software development or my random thoughts on random things.

You've reached the end of my first ever blog post. This was just raw words from me. One day I will come back to this and cringe at it probably, but it is the risk that I am willing to take if the reward is becoming a better communicator.

Thanks for reading :)