#architecture

4 posts · all tags

Aug 4, 2025 · Mixture of Experts

In the pursuit of scaling neural networks to unprecedented parameter counts while maintaining computational tractability, the paradigm of conditional computation has emerged as a cornerstone of modern deep learning architectures. A prominent and highly successful incarnation of this principle is the Mixture of Experts (MoE) layer. At its core, an MoE model eschews the monolithic, dense activation of traditional networks, wherein every parameter is engaged for every input. Instead, it employs a collection of specialized subnetworks, termed experts, and dynamically selects a sparse combination of these experts to process each input token. This approach allows for a dramatic increase in model capacity without a commensurate rise in computational cost (FLOPs), as only a fraction of the network's parameters are utilized for any given forward pass.
Sep 21, 2024 · Kolmogorov-Arnold Networks

I have been playing with some implementations of Kolmogorov-Arnold Networks. Here is an easy implementation for anyone who wants to try it out.
Jul 21, 2024 · Encoder vs Decoder vs EncoderDecoder Architectures

Language models are a crucial component in natural language processing (NLP). The architecture of these models can be broadly categorized into three types: encoder-only, decoder-only, and encoder-decoder architectures. Each of these architectures has distinct characteristics and applications.
Jan 8, 2024 · Implementing the Transformer in Python

Hello everyone. Today the I will present a sketch of a transformer implementations. The focus here will be only on the forward pass of the architecture and not on learning the weights.

Aug 4, 2025 · Mixture of Experts

Sep 21, 2024 · Kolmogorov-Arnold Networks

Jul 21, 2024 · Encoder vs Decoder vs EncoderDecoder Architectures

Jan 8, 2024 · Implementing the Transformer in Python