On this page
torch.nn
These are the basic building blocks for graphs:
torch.nn
 Containers
 Convolution Layers
 Pooling layers
 Padding Layers
 Nonlinear Activations (weighted sum, nonlinearity)
 Nonlinear Activations (other)
 Normalization Layers
 Recurrent Layers
 Transformer Layers
 Linear Layers
 Dropout Layers
 Sparse Layers
 Distance Functions
 Loss Functions
 Vision Layers
 Shuffle Layers
 DataParallel Layers (multiGPU, distributed)
 Utilities
 Quantized Functions
 Lazy Modules Initialization
Parameter 
A kind of Tensor that is to be considered a module parameter. 
UninitializedParameter 
A parameter that is not initialized. 
UninitializedBuffer 
A buffer that is not initialized. 
Containers
Module 
Base class for all neural network modules. 
Sequential 
A sequential container. 
ModuleList 
Holds submodules in a list. 
ModuleDict 
Holds submodules in a dictionary. 
ParameterList 
Holds parameters in a list. 
ParameterDict 
Holds parameters in a dictionary. 
Global Hooks For Module
register_module_forward_pre_hook 
Registers a forward prehook common to all modules. 
register_module_forward_hook 
Registers a global forward hook for all the modules 
register_module_backward_hook 
Registers a backward hook common to all the modules. 
register_module_full_backward_pre_hook 
Registers a backward prehook common to all the modules. 
register_module_full_backward_hook 
Registers a backward hook common to all the modules. 
register_module_buffer_registration_hook 
Registers a buffer registration hook common to all modules. 
register_module_module_registration_hook 
Registers a module registration hook common to all modules. 
register_module_parameter_registration_hook 
Registers a parameter registration hook common to all modules. 
Convolution Layers
Applies a 1D convolution over an input signal composed of several input planes. 

Applies a 2D convolution over an input signal composed of several input planes. 

Applies a 3D convolution over an input signal composed of several input planes. 

Applies a 1D transposed convolution operator over an input image composed of several input planes. 

Applies a 2D transposed convolution operator over an input image composed of several input planes. 

Applies a 3D transposed convolution operator over an input image composed of several input planes. 

A 

A 

A 

A 

A 

A 

Extracts sliding local blocks from a batched input tensor. 

Combines an array of sliding local blocks into a large containing tensor. 
Pooling layers
Applies a 1D max pooling over an input signal composed of several input planes. 

Applies a 2D max pooling over an input signal composed of several input planes. 

Applies a 3D max pooling over an input signal composed of several input planes. 

Computes a partial inverse of 

Computes a partial inverse of 

Computes a partial inverse of 

Applies a 1D average pooling over an input signal composed of several input planes. 

Applies a 2D average pooling over an input signal composed of several input planes. 

Applies a 3D average pooling over an input signal composed of several input planes. 

Applies a 2D fractional max pooling over an input signal composed of several input planes. 

Applies a 3D fractional max pooling over an input signal composed of several input planes. 

Applies a 1D poweraverage pooling over an input signal composed of several input planes. 

Applies a 2D poweraverage pooling over an input signal composed of several input planes. 

Applies a 1D adaptive max pooling over an input signal composed of several input planes. 

Applies a 2D adaptive max pooling over an input signal composed of several input planes. 

Applies a 3D adaptive max pooling over an input signal composed of several input planes. 

Applies a 1D adaptive average pooling over an input signal composed of several input planes. 

Applies a 2D adaptive average pooling over an input signal composed of several input planes. 

Applies a 3D adaptive average pooling over an input signal composed of several input planes. 
Padding Layers
Pads the input tensor using the reflection of the input boundary. 

Pads the input tensor using the reflection of the input boundary. 

Pads the input tensor using the reflection of the input boundary. 

Pads the input tensor using replication of the input boundary. 

Pads the input tensor using replication of the input boundary. 

Pads the input tensor using replication of the input boundary. 

Pads the input tensor boundaries with zero. 

Pads the input tensor boundaries with zero. 

Pads the input tensor boundaries with zero. 

Pads the input tensor boundaries with a constant value. 

Pads the input tensor boundaries with a constant value. 

Pads the input tensor boundaries with a constant value. 
Nonlinear Activations (weighted sum, nonlinearity)
Applies the Exponential Linear Unit (ELU) function, elementwise, as described in the paper: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). 

Applies the Hard Shrinkage (Hardshrink) function elementwise. 

Applies the Hardsigmoid function elementwise. 

Applies the HardTanh function elementwise. 

Applies the Hardswish function, elementwise, as described in the paper: Searching for MobileNetV3. 

Applies the elementwise function: 

Applies the elementwise function: 

Allows the model to jointly attend to information from different representation subspaces as described in the paper: Attention Is All You Need. 

Applies the elementwise function: 

Applies the rectified linear unit function elementwise: 

Applies the elementwise function: 

Applies the randomized leaky rectified liner unit function, elementwise, as described in the paper: 

Applied elementwise, as: 

Applies the elementwise function: 

Applies the Gaussian Error Linear Units function: 

Applies the elementwise function: 

Applies the Sigmoid Linear Unit (SiLU) function, elementwise. 

Applies the Mish function, elementwise. 

Applies the Softplus function $\text{Softplus}(x) = \frac{1}{\beta} * \log(1 + \exp(\beta * x))$ elementwise. 

Applies the soft shrinkage function elementwise: 

Applies the elementwise function: 

Applies the Hyperbolic Tangent (Tanh) function elementwise. 

Applies the elementwise function: 

Thresholds each element of the input Tensor. 

Applies the gated linear unit function ${GLU}(a, b)= a \otimes \sigma(b)$ where $a$ is the first half of the input matrices and $b$ is the second half. 
Nonlinear Activations (other)
Applies the Softmin function to an ndimensional input Tensor rescaling them so that the elements of the ndimensional output Tensor lie in the range 

Applies the Softmax function to an ndimensional input Tensor rescaling them so that the elements of the ndimensional output Tensor lie in the range [0,1] and sum to 1. 

Applies SoftMax over features to each spatial location. 

Applies the $\log(\text{Softmax}(x))$ function to an ndimensional input Tensor. 

Efficient softmax approximation as described in Efficient softmax approximation for GPUs by Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 
Normalization Layers
Applies Batch Normalization over a 2D or 3D input as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . 

Applies Batch Normalization over a 4D input (a minibatch of 2D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . 

Applies Batch Normalization over a 5D input (a minibatch of 3D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . 

A 

A 

A 

Applies Group Normalization over a minibatch of inputs as described in the paper Group Normalization 

Applies Batch Normalization over a NDimensional input (a minibatch of [N2]D inputs with additional channel dimension) as described in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . 

Applies Instance Normalization over a 2D (unbatched) or 3D (batched) input as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. 

Applies Instance Normalization over a 4D input (a minibatch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. 

Applies Instance Normalization over a 5D input (a minibatch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization. 

A 

A 

A 

Applies Layer Normalization over a minibatch of inputs as described in the paper Layer Normalization 

Applies local response normalization over an input signal composed of several input planes, where channels occupy the second dimension. 
Recurrent Layers
Base class for RNN modules (RNN, LSTM, GRU). 

Applies a multilayer Elman RNN with $\tanh$ or $\text{ReLU}$ nonlinearity to an input sequence. 

Applies a multilayer long shortterm memory (LSTM) RNN to an input sequence. 

Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence. 

An Elman RNN cell with tanh or ReLU nonlinearity. 

A long shortterm memory (LSTM) cell. 

A gated recurrent unit (GRU) cell 
Transformer Layers
A transformer model. 

TransformerEncoder is a stack of N encoder layers. 

TransformerDecoder is a stack of N decoder layers 

TransformerEncoderLayer is made up of selfattn and feedforward network. 

TransformerDecoderLayer is made up of selfattn, multiheadattn and feedforward network. 
Linear Layers
A placeholder identity operator that is argumentinsensitive. 

Applies a linear transformation to the incoming data: $y = xA^T + b$ 

Applies a bilinear transformation to the incoming data: $y = x_1^T A x_2 + b$ 

A 
Dropout Layers
During training, randomly zeroes some of the elements of the input tensor with probability 

Randomly zero out entire channels (a channel is a 1D feature map, e.g., the $j$th channel of the $i$th sample in the batched input is a 1D tensor $\text{input}[i, j]$). 

Randomly zero out entire channels (a channel is a 2D feature map, e.g., the $j$th channel of the $i$th sample in the batched input is a 2D tensor $\text{input}[i, j]$). 

Randomly zero out entire channels (a channel is a 3D feature map, e.g., the $j$th channel of the $i$th sample in the batched input is a 3D tensor $\text{input}[i, j]$). 

Applies Alpha Dropout over the input. 

Randomly masks out entire channels (a channel is a feature map, e.g. 
Sparse Layers
A simple lookup table that stores embeddings of a fixed dictionary and size. 

Computes sums or means of 'bags' of embeddings, without instantiating the intermediate embeddings. 
Distance Functions
Returns cosine similarity between $x_1$ and $x_2$, computed along 

Computes the pairwise distance between input vectors, or between columns of input matrices. 
Loss Functions
Creates a criterion that measures the mean absolute error (MAE) between each element in the input $x$ and target $y$. 

Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input $x$ and target $y$. 

This criterion computes the cross entropy loss between input logits and target. 

The Connectionist Temporal Classification loss. 

The negative log likelihood loss. 

Negative log likelihood loss with Poisson distribution of target. 

Gaussian negative log likelihood loss. 

The KullbackLeibler divergence loss. 

Creates a criterion that measures the Binary Cross Entropy between the target and the input probabilities: 

This loss combines a 

Creates a criterion that measures the loss given inputs $x1$, $x2$, two 1D minibatch or 0D 

Measures the loss given an input tensor $x$ and a labels tensor $y$ (containing 1 or 1). 

Creates a criterion that optimizes a multiclass multiclassification hinge loss (marginbased loss) between input $x$ (a 2D minibatch 

Creates a criterion that uses a squared term if the absolute elementwise error falls below delta and a deltascaled L1 term otherwise. 

Creates a criterion that uses a squared term if the absolute elementwise error falls below beta and an L1 term otherwise. 

Creates a criterion that optimizes a twoclass classification logistic loss between input tensor $x$ and target tensor $y$ (containing 1 or 1). 

Creates a criterion that optimizes a multilabel oneversusall loss based on maxentropy, between input $x$ and target $y$ of size $(N, C)$. 

Creates a criterion that measures the loss given input tensors $x_1$, $x_2$ and a 

Creates a criterion that optimizes a multiclass classification hinge loss (marginbased loss) between input $x$ (a 2D minibatch 

Creates a criterion that measures the triplet loss given an input tensors $x1$, $x2$, $x3$ and a margin with a value greater than $0$. 

Creates a criterion that measures the triplet loss given input tensors $a$, $p$, and $n$ (representing anchor, positive, and negative examples, respectively), and a nonnegative, realvalued function ("distance function") used to compute the relationship between the anchor and positive example ("positive distance") and the anchor and negative example ("negative distance"). 
Vision Layers
Rearranges elements in a tensor of shape $(*, C \times r^2, H, W)$ to a tensor of shape $(*, C, H \times r, W \times r)$, where r is an upscale factor. 

Reverses the 

Upsamples a given multichannel 1D (temporal), 2D (spatial) or 3D (volumetric) data. 

Applies a 2D nearest neighbor upsampling to an input signal composed of several input channels. 

Applies a 2D bilinear upsampling to an input signal composed of several input channels. 
Shuffle Layers
Divide the channels in a tensor of shape $(*, C , H, W)$ into g groups and rearrange them as $(*, C \frac g, g, H, W)$, while keeping the original tensor shape. 
DataParallel Layers (multiGPU, distributed)
Implements data parallelism at the module level. 

Implements distributed data parallelism that is based on 
Utilities
From the torch.nn.utils
module
clip_grad_norm_ 
Clips gradient norm of an iterable of parameters. 
clip_grad_value_ 
Clips gradient of an iterable of parameters at specified value. 
parameters_to_vector 
Convert parameters to one vector 
vector_to_parameters 
Convert one vector to the parameters 
Abstract base class for creation of new pruning techniques. 
Container holding a sequence of pruning methods for iterative pruning. 

Utility pruning method that does not prune any units but generates the pruning parametrization with a mask of ones. 

Prune (currently unpruned) units in a tensor at random. 

Prune (currently unpruned) units in a tensor by zeroing out the ones with the lowest L1norm. 

Prune entire (currently unpruned) channels in a tensor at random. 

Prune entire (currently unpruned) channels in a tensor based on their L 

Applies pruning reparametrization to the tensor corresponding to the parameter called 

Prunes tensor corresponding to parameter called 

Prunes tensor corresponding to parameter called 

Prunes tensor corresponding to parameter called 

Prunes tensor corresponding to parameter called 

Globally prunes tensors corresponding to all parameters in 

Prunes tensor corresponding to parameter called 

Removes the pruning reparameterization from a module and the pruning method from the forward hook. 

Check whether 

weight_norm 
Applies weight normalization to a parameter in the given module. 
remove_weight_norm 
Removes the weight normalization reparameterization from a module. 
spectral_norm 
Applies spectral normalization to a parameter in the given module. 
remove_spectral_norm 
Removes the spectral normalization reparameterization from a module. 
skip_init 
Given a module class object and args / kwargs, instantiates the module without initializing parameters / buffers. 
Parametrizations implemented using the new parametrization functionality in torch.nn.utils.parameterize.register_parametrization()
.
Applies an orthogonal or unitary parametrization to a matrix or a batch of matrices. 

Applies spectral normalization to a parameter in the given module. 
Utility functions to parametrize Tensors on existing Modules. Note that these functions can be used to parametrize a given Parameter or Buffer given a specific function that maps from an input space to the parametrized space. They are not parameterizations that would transform an object into a parameter. See the Parametrizations tutorial for more information on how to implement your own parametrizations.
Adds a parametrization to a tensor in a module. 

Removes the parametrizations on a tensor in a module. 

Context manager that enables the caching system within parametrizations registered with 

Returns 
A sequential container that holds and manages the 
Utility functions to calls a given Module in a stateless manner.
Performs a functional call on the module by replacing the module parameters and buffers with the provided ones. 
Utility functions in other modules
Holds the data and list of 

Packs a Tensor containing padded sequences of variable length. 

Pads a packed batch of variable length sequences. 

Pad a list of variable length Tensors with 

Packs a list of variable length Tensors 

Unpacks PackedSequence into a list of variable length Tensors 

Unpad padded Tensor into a list of variable length Tensors 
Flattens a contiguous range of dims into a tensor. 

Unflattens a tensor dim expanding it to a desired shape. 
Quantized Functions
Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision. PyTorch supports both per tensor and per channel asymmetric linear quantization. To learn more how to use quantized functions in PyTorch, please refer to the Quantization documentation.
Lazy Modules Initialization
A mixin for modules that lazily initialize parameters, also known as "lazy modules." 
© 2024, PyTorch Contributors
PyTorch has a BSDstyle license, as found in the LICENSE file.
https://pytorch.org/docs/2.1/nn.html