..

Building a Simple GPT-Style Q&A LLM from Scratch

A good resource to use alongside this notebook is the original GPT paper [1]. This notebook largely relies on that paper for model architectures and implementation.

This article will walk through building a simple GPT style model from scratch using pytorch [1,2]. The goal of this article is to train a basic large language model from start to finish in one notebook. We will train an LLM that is small enough to fit in a single GPU during training and inference, so the notebook can be run in popular cloud GPU services (Google Colab, Kaggle, Paperspace, etc...). The computation graph of the model that we will build in this article is as follows:

image-2.png

This architecture resembles the original GPT model, and is quite similar to GPT2 and GPT3, with the main difference being that it is smaller (less decoder blocks and smaller embedding sizes) [1,3,4]. We will zoom into each step of this diagram throughout this article to discuss the math, code, and intuition behind them.

According to the original GPT paper, there are two main training stages for the early GPT models, pretraining and supervised fine tuning [1]. Pretraining is a self supervised learning task, where parts of the input data are omitted and used as target variables. Self supervised fine tuning works similar to traditional supervised learning tasks, with human annoted labels for input data.

1: Pretraining

The first stage in building a GPT model is pretraining. Pretraining builds the "base" of an LLM. It allows the model to understand statistical properties of language, grammar, and context.

Pretraining Goal

The goal of pretraining is simple: to have a model that can reliably predict the next token given the previous k tokens in a sequence. The final result of pretraining is a deep learning model that takes in $k$ tokens and produces a discrete probability distribution of what the $k+1$ token should be. We want this distribution to show a high value for the correct token and low values for the incorrect ones.

image-2.png

To achieve this, we start off with a large dataset of raw text. This text can be taken from books, blogs, wikis, research papers, and other text sources. After compiling the large dataset of text, we split the dataset into "chunks" of tokens, where each chunk has a certain amount of tokens (512 gpt, 1024 gpt2, 16385 gpt-3). This chunk size is known as the "context window". A pretrained model will take in that many tokens, and output the most likely next token.

What is a Token?

When dealing with LLMs we use the word "token" to describe the smallest "unit" of text that an LLM can analyze [5]. Tokens can generally be thought of as words conceptually. When analyzing a sequence of text, an LLM first has to convert the text to tokens. This is similar to a dictionary lookup, each word/token will have an integer "index" in the lookup. This index is what will actually be fed into the network to be analyzed.

image-3.png

Pretraining Data Format

Each example of the pretraining dataset is a chunk of tokens. The same chunk of tokens is used for the input and output, but the output is shifted 1 token into the "future". The reason for this has to do with the parallel processing capabilities of the transformer, which we will go into depth further in the transformer section. The following visual helps show what the training data looks like for the pretraining model.

image.png

Because the model uses transformers and parallel processing, a single example like the one above is actually in a sense 6 different examples. The model is learning the following predictive patterns:

This will be clearer in the transformer section of the article. The main point to know now is what the format of the input and outputs of the training data should look like in the pretraining step. The outputs are the inputs, shifted by one token so that each input token aligns with the output token that comes directly after it in the original sequence.

1.1: Download Pretraining Dataset

Before doing a full pre-training loop, we will do a "test run" using a small dataset we can fit in to memory. This will allow us to focus on the internals of the model rather than complexities of data processing. We can use the Salesforce wikitext dataset that consists of an extract of good and featured wikipedia articles [6].

We will load the dataset from the huggingface datasets hub. The huggingface datasets package provides an easy way to load, preprocess, and use a variety of datasets for deep learning [7].

import warnings
import torch
import math
import time
import os
import matplotlib.pyplot as plt
from itertools import cycle
from datasets import Dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from torch.optim.lr_scheduler import _LRScheduler
warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
cuda
from datasets import load_dataset
dataset = load_dataset("EleutherAI/wikitext_document_level", "wikitext-2-raw-v1", split="train")

1.2 Tokenize & Chunk the Dataset

For pretraining language models, a simple approach to tokenizing and chunking text is as follows:

  1. Concatenate all the text into one giant "blob". This means you have one large string.
  2. Tokenize the whole blob into one list of tokens. At this point you have one large array of integers.
  3. Chunk the tokens into fixed size blocks (1024, 2048, larger...) (this is the "context window"). At this point you have multiple arrays of integers, each of the same length (context size).

This process will change slightly when using datasets that are too large to fit into memory.

1.2.1 Tokenizing: Using Tiktoken

One easy way to tokenize our dataset is to use OpenAI's tokenizer implementation tiktoken for BPE (Byte Pair Encoding) [8]. This article will not go into detail on how the implementation of a tokenizer works, but just know that it converts strings of text into lists of integers, and can also convert the lists of integers back into strings of texts.

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2") # Get the same tokenizer used for GPT-2


print("Vocabulary size:", tokenizer.n_vocab) # Vocabilary size is how many unique tokens the tokenizer can encode
print("End of text token:", tokenizer.eot_token) # End of text token is used to indicate the end of a text sequence
print("Example tokenization:", tokenizer.encode("Hello world!"))

# Convert entire dataset into a single string
# This dataset is small enough to fit into memory
# For larger datasets, you may need to use more 
# sophisticated methods to process the data.
all_text = ""
all_data = dataset["page"]
for example in all_data:
    all_text += "<page> "+ example + " </page>"

# Tokenize the entire text at once
tokenized_text = tokenizer.encode(all_text)


# We will create a function that generates a dataset of examples
# for the language model. The function will take in the number of
# examples to generate, the block size, and the test split.
# It will return the training and test datasets.
def get_dataset(num_examples, context_window_length, test_split=0.1):
    input_blocks = [] # List to store input sequences
    target_blocks = [] # List to store target sequences

    # Use a sliding window to create input/target sequences
    for i in range(0, len(tokenized_text), context_window_length + 1):
        block = tokenized_text[i:i+context_window_length+ 1]
        
        # Skip blocks that are too short
        if len(block) < context_window_length + 1:
            continue

        input_seq = block[:-1]  
        target_seq = block[1:]  

        input_blocks.append(input_seq)
        target_blocks.append(target_seq)
        
        # Stop if we have enough examples
        if len(input_blocks) >= num_examples:
            break

    # Convert to tensors for pytorch and move to gpu
    inputs = torch.tensor(input_blocks, dtype=torch.long).to(device)
    targets = torch.tensor(target_blocks, dtype=torch.long).to(device)

    # Calculate train/test split point
    split_idx = int(num_examples * (1 - test_split))

    # Split into train/test
    train_inputs = inputs[:split_idx]
    train_targets = targets[:split_idx]
    test_inputs = inputs[split_idx:]
    test_targets = targets[split_idx:]
    return train_inputs, train_targets, test_inputs, test_targets

# Get a small dataset
i, o, _, _ = get_dataset(2, 4, 0)
print("Input Shape", i.shape)
print("Output Shape", o.shape)
print("Input Example:")
print(i)
print("Output Example:")
print(o)
Vocabulary size: 50257
End of text token: 50256
Example tokenization: [15496, 995, 0]
Input Shape torch.Size([2, 4])
Output Shape torch.Size([2, 4])
Input Example:
tensor([[   27,  7700,    29,   220],
        [  569, 18354,  7496, 17740]], device='cuda:0')
Output Example:
tensor([[ 7700,    29,   220,   796],
        [18354,  7496, 17740,  6711]], device='cuda:0')

Using our tokenizer methods, we have generated a "dummy" dataset that will be used for the rest of the diagrams / examples of the article to show the shapes of the matrices as they flow through the model.

This means that we have a context length of 4 tokens, and a batch size of 2. The full dummy dataset has a total of 2 examples. This is far smaller than the dataset would be in reality - but is useful for introducing the architecture.

1.3 Build the LLM

Now that we have a small dummy dataset. We can build our LLM model architecture in pytorch.

1.3.1 Config Object

First, we can build a "config" object that will store our parameters for the network. We will go through each parameter in depth later on in the network.

import torch
import torch.nn as nn
import torch.nn.functional as F

# A simple configuration container
class GPTConfig:
    def __init__(
        self, 
        vocab_size,  # size of the vocabulary, from tokenizer, for gpt2 tokenizer it is 50257
        n_layer,   # number of transformer blocks
        n_head,    # number of attention heads for each transformer block
        n_embd,  # embedding dimension for each token
        seq_len,  # sequence length for the model - e.g. the "context window" 
    
    ):
        self.vocab_size = vocab_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd
        self.seq_len = seq_len
     
test_config = GPTConfig(
    vocab_size=tokenizer.n_vocab,
    n_layer=2,  
    n_head=3,
    n_embd=6,
    seq_len=4,
)

1.3.2 Token Embedding Layer

image-5.png

Our first layer of the network is going to be a token embedding layer. This layer is a little bit different than traditional neural network layers. It is essentially a lookup table that returns an "embedding vector" for a given integer index. The goal of this layer is to convert tokens to vectors. These vectors are tuned as the network is trained so that their position in space relative to the other tokens reflects their statistical relationships with each other.

The embedding layer converts a discrete token (integer) into a semantic representation of that token (vector). Before the embedding layer, the model has no idea of what the token means or how it relates to other tokens. After the embedding layer, the model understands the semantic meaning of the token by its relationship with other tokens in the embedding space. For more information on word embeddings see the Word2Vec paper [13].

These are vectors that start off as random, but slowly assume values within embedding space that reflect the semantic meaning of the token. This process happens during training.

image-3.png

For our dummy dataset, the input to this layer will be a matrix of size $2x4$, batch x token indices. The output will be $2x4x6$, batch x tokens x embedding dimensions. This transformation can be visuzlized as follows:

image-4.png

token_embedding = nn.Embedding(test_config.vocab_size, test_config.n_embd).to(device)
test_batch_inputs, _, _, _ = get_dataset(2, test_config.seq_len, 0)
print("Batch shape:", test_batch_inputs.shape, "Batch x Seq Len")
print("After embedding:", token_embedding(test_batch_inputs).shape, "Batch x Seq Len x Embedding Dim")
print("")
print("Before embedding")
print(test_batch_inputs)
print("After embedding")
print(token_embedding(test_batch_inputs))
Batch shape: torch.Size([2, 4]) Batch x Seq Len
After embedding: torch.Size([2, 4, 6]) Batch x Seq Len x Embedding Dim

Before embedding
tensor([[   27,  7700,    29,   220],
        [  569, 18354,  7496, 17740]], device='cuda:0')
After embedding
tensor([[[ 0.7290, -0.2958, -1.0399,  1.4077,  0.7276,  1.1554],
         [-0.5482, -0.3365, -0.1113,  1.3904,  1.6721, -1.5533],
         [ 0.0291, -1.3123,  1.4436,  0.7401,  1.1435,  1.1597],
         [-0.5509,  1.1057,  1.5446,  0.7508,  0.4335,  1.8201]],

        [[ 0.6803, -0.9699, -1.0296,  1.4327,  0.0629,  1.0485],
         [ 0.5021,  0.6006,  0.7069, -0.3284, -0.2663, -1.5875],
         [ 0.6808,  1.7683,  0.8311, -0.3728, -0.1172, -0.0622],
         [ 0.9335, -0.1899, -1.4040,  0.4846,  0.6599,  0.7488]]],
       device='cuda:0', grad_fn=<EmbeddingBackward0>)

In this example, we are using an embedding dimension of 6, so each original token is mapped to a vector of length 6. As of right now, these vectors don't have any actual meaning, they are randomly initialized. However, during the training process, these entries will be slowly nudged via backpropagation and over time they will start to assume meaning for their respective tokens.

1.3.3 Positional Encoding Layer

image-2.png

After embedding the tokens into embedding vectors, we will add a positional encoding to the vectors. Why do we need a positional encoding? Consider the following sentence:

The planet is smaller than the other planet.

A positional encoding allows the model to differentiate the two instances of the word "planet". Without a positional encoding, the two token embedding vectors for each instance of the word planet would be exactly the same. Having a positional encoding allows the model to differentiate the two usages within the same instance.

We will use the positional encoding formula that was used in the original transformer paper [9]. The formula works by starting out with a matrix of shape sequence length x embedding dimension. The matrix is then filled in with the following formula:

$$PE(POS,2i) = sin(\frac{pos}{10000^\frac{2i}{d}})$$ $$PE(POS,2i+1) = cos(\frac{pos}{10000^\frac{2i}{d}})$$

Where $POS$ is the position of the token in the sequence, i is the index of the embedding dimension within the token, and d is the embedding dimension size of the model. This entire formula outputs a matrix, and the matrix that it outputs is dependent on the embedding size. The resulting matrix will be (seq_length x embedding size). The matrix starts out as all zeros, and then the formula is applied.

def get_position_encoding(seq_len, d, n=10000):
    """
    Computes the positional encoding matrix of shape (seq_len, d).
    
    Args:
        seq_len (int): Length of the sequence.
        d (int): Dimension of the embedding.
        n (float): The base for the exponential term (default 10000 in many Transformer implementations).
    
    Returns:
        torch.Tensor: A tensor of shape (seq_len, d) containing the positional encodings.
    """
    
    P = torch.zeros(seq_len, d).to(device)
    for pos in range(seq_len):
        for i in range(0, d // 2):
            P[pos, 2 * i] = math.sin(pos / (n ** ((2 * i) / d)))
            if i + 1 < d:
                P[pos, 2* i + 1] = math.cos(pos / (n ** ((2 * i) / d)))

    return P.unsqueeze(0)


# Example usage:
position_encoding = get_position_encoding(seq_len=test_config.seq_len, d=test_config.n_embd)
print("Position encoding shape:", position_encoding.shape)
Position encoding shape: torch.Size([1, 4, 6])

Once we have the positional encoding, we add that using element wise addition to the embedding vectors. Since we are using pytorch, the addition will "broadcast" across the first dimension. This means that the 4x6 positional encoding matrix will be added to each batch example in parallel.

image.png

test_embeddings = token_embedding(test_batch_inputs)
test_embeddings_with_pos = test_embeddings + position_encoding
print("Token embeddings shape:", test_embeddings.shape)
print("Position encodings shape:", position_encoding.shape)
print("Sum of token embeddings and position encodings:",test_embeddings_with_pos.shape)
Token embeddings shape: torch.Size([2, 4, 6])
Position encodings shape: torch.Size([1, 4, 6])
Sum of token embeddings and position encodings: torch.Size([2, 4, 6])

What is the Intuition Behind Positional Encodings?

At first, it can be a challenging to intuit what the positional encoding is doing. The positional encoding is just a constant matrix (given the sequence length and embedding size), with the values set to a desirable pattern. Each row of the matrix aligns to a token, meaning a constant vector will be added to the token at position 1 every time, and a different constant vector added to the token at position 2 every time, etc...

This differentiates the value of the word "planet" coming at the beginning vs the end of the sentence. However, sometimes relative position of words in a sentence is more important than absolute position. So how do we take that into account? The answer is that the relative relationships between words are emergent. These happen through the process of attention, which we will discuss later.

The key point here is that without positional encoding, these two sentences would look the same:

The positional encoding makes the vectors for dog and owner different in the two sentences, which allows attention to catch onto the relative relationships between these two words.

The below image shows an example of a positional encoding matrix. It looks interesting but what exactly are we looking at? Why does this help the model encode the position of each embedding vector. Remember, each row in our embedding vector represents a word/token. We will be adding this matrix to the embedding matrix to encode positions. One thing to note about this matrix is that each row is unique. There is also a smooth transition between each row. If you take rows 27 and 28 from this matrix, they are going to have very similar patterns. However if you take rows 1 and 120 from this matrix, they are going to differ much more. This smoothness is also an important feature that helps the model understand position [10].

image.png

There is nothing inherently special about the formula above, there are other formulas for positional encoding. The key thing to note is that there needs to be some matrix that we can add to our embedding matrix that encodes position. This formula has certain properties that are biased towards making it easy for the model to do that.

1.3.4 Masked Multiheaded Self Attention

image-9.png

After positional encoding, we get to the core of the LLM - the (decoder only) transformer. The first step of the transformer is masked multiheaded self attention. We can break down the internals of the transformer into three parts: self attention, then masking, then the multiple heads.

Self Attention

The core idea behind self attention is that it allows every token to "talk" to the other tokens. Attention "reframes" a word's meaning into a combination of all the other words in the context window. A single self attention head does one of many possible "reframings" of each token. It allows for the model to understand a each word's context in relation to the other words of the sentence.

Self attention starts with just the token embedding matrix with position encodings. It "decomposes" this matrix into queries, keys, and values. In reality all of these are just vectors / matrices that get tuned during training, but we can conceptually think of them as queries, keys, and values due to their dot product operations that take place in the attention operation.

The original equation for scaled dot product attention is as follows [9]: $$Attention(Q,K,V)=softmax(\frac{QK^t}{\sqrt{d_k}})V$$

Q, K, and V are query, key, and value matrices. They are set initially through matrix projections of the input embedding matrix. The token embeddings are multplied by $W_q$, $W_k$, and $W_v$ matrices. These weight matrices start off as random and are tuned during the process of training the network. Meaning during training, the network learns what "queries" to ask, and what "keys" and "values" to set via backpropagation by tuning these matrices. It learns how to transform the embedding matrix into "keys", "queries", and "values" in order to best reduce the loss of the network.

The projection operation to generate Q,K, and V are shown below using the dimensions for our dummy dataset/network.

image-3.png

Q, K, and V are all matrices that are of shape num tokens x embedding size. Each token has a query vector in "query space". Each token also has a key vector in "key space". When we do the $QK^T$ operation, we are calculating how well each token query matches each key. This could be thought of as sort of a "fuzzy lookup" using vector dot products. If the query and key have a high dot product, that means the vectors are pointing in a direction near each other. This also means those two tokens are important to take into account together.

After doing the matrix multiplication between $Q$ and $K^T$, we end up with a similarity matrix of tokens. This similarity matrix tells us how much each token attends to each other token. Each row of the $QK^T$ matrix is put through the softmax function so each row becomes a probability distribution that adds to one. This probability distribution can be interpreted as how strong of a match each key is to the query of the row. How much each key "attends" to each query.

The value matrix can be thought of the actual content/information that the each token has to offer. This value matrix is weighted by the similarities of the keys/queries to produce the final output of self attention.

Self Attention: Further Intuition

There are some alternative ways to conceive of the individual operations of attention that can help at a conceptual / intuitive level to know what the network is doing. Let's go through each operation in attention and try to simplify down in english what it is doing at a conceptual level.

Q, K, V Matrices Intuition

We know that the $Q$, $K$, $V$ matrices are created by a matrix operation to the input of the transformer (for the first block, this is our position encoded word embeddings). We also know that the weights to create these matrices are tuned through the process of backpropagation. But how can we think of these matrices themselves? What information do they actually contain?

$Q$ Matrix Intuition

The $Q$ matrix can be thought of as n rows of queries or questions, where n is the number of tokens in the input. When thinking about the $Q$ matrix, think of it as n vectors instead of a single matrix. Where each vector is a query or question about the corresponding word that could be answered by some combinations of the other words. Remember, we are "reframing" the give word as some combination of the other words. For example it could look like the following:

image.png

In this case each token has a corresponding question. These questions or queries are going to be questions that can be answered by the surrounding tokens. So how are these questions created? $W_q$ is responsible for creating the right questions for each token (with position). $W_q$ maps a token to a relevant query about that token. These queries become relevant through the process of training via backpropagation.

$K$ Matrix Intuition

We can think of the $K$ matrix as n row vectors of keys, where n is the number of tokens in the input. What do we mean by "keys". It is easiest to think of keys as facts that can help answer queries. Above in the query section we asked questions like "what noun do I describe?". A key that might closely match this query would be "I am a noun that can be described". Similar to the queries, $W_k$ creates these keys by learning the right mapping from token to corresponding key. These keys are good matches for the queries becuase of the $QK^T$ operation that is performed in training.

image-5.png

Overall, each key can be conceived of as a fact about that token that could help answer a queries that the other tokens might have.

$QK^T$ Operation Intuition:

Now that we have an intuition of the $Q$ and $K$ matrix, we can think about what the matrix multiplication operation $QK^T$ in the attention equation is doing. The $QK^T$ operation is a matching operation, where each query is compared with each key, by performing a dot product operation. If the dot product is large, that means that the key answers or "attends" to the query. If the dot product is small, that means the key is unrelated and does not help answer the query. The $QK^T$ operation "reframes" each query into a set of keys. The resulting matrix of the operation can be thought of as n row vectors. Every dimension or coordinate of these row vectors is a weight for a token key/fact. So a vector in this space is some weighted combination of all of the tokens (keys).

Basically, what we are doing is redescribing the original token query/question as a weighted vector of all of the token keys/answers. Instead of asking a question about of token, we have n different answers, all with their own weights.

When doing the $QK^T$ operation, we are reframing the query row vectors to a combination of the keys. Remember each query has to do with how that token relates to the other tokens, so the answers can be formed as some combination of the other tokens.

image-4.png

$\frac{QK^T}{\sqrt{d_k}}$ Operation Intuition:

This operation is done to make the output of the softmax more stable. The dot product of two random vectors of dimension $d_k$ results in values that tend to grow proportionally to $d_k$. This ensures that no matter, how large $d_k$ is, the softmax works as expected and does not result in extreme values.

This is an elementwise division so every element of the matrix is divided by this value. The resulting matrix can be thought of in the same way as the $QK^T$ result, just scaled.

$softmax(\frac{QK^T}{\sqrt{d_k}})$ Operation Intuition:

The softmax operation is performed row-wise on the $\frac{QK^T}{\sqrt{d_k}}$ matrix. This means every row results in a probability distribution. We can still think of this as each token is represented as a "reframed" query vector, but now we know that each row vector adds up to one.

$V$ matrix Intuition

The $V$ matrix is a bit hard to conceive of, but can be thought of as a column matrix, where each column is a learned feature, and each element of those vectors is the value of that feature for the token in that row. They are "feature" vectors, that contain information about specific learned features for each token. When we do the final operation, these feature vectors will be weighted, meaning that the values of these features for certain tokens on should be focused on more than other tokens. The $V$ matrix is the actual content or output of attention. This content will be adjusted by the weights from the $softmax(\frac{QK^T}{\sqrt{d_k}})$ operation

image-8.png

$softmax(\frac{QK^T}{\sqrt{d_k}})V$ Operation Intuition:

Now for the final operation of attention, multiplying by the $V$ matrix. We can think of the V matrix as containing the original content of the embeddings. We weight this content based on the query/key matches. In other words, we weight the content based on the specific questions we are trying to ask and how the other words in context answer those questions.

$$softmax(\frac{QK^T}{\sqrt{d_k}})V$$

image-7.png

When putting this all together (using the original dimensions of our "test" config object as we are in the code), we can see what all the matrix operations and dimensions through the self attention operation are.

image-6.png

Self Attention: Code

Self attention can be written as a self contained pytorch module as shown below.

class SelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
        self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
        self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values

    def forward(self, x):
        print("Attention input shape:", x.shape)
        print("")
        print("Query weights shape:", self.Wq.shape)
        print("Key weights shape:", self.Wk.shape)
        print("Value weights shape:", self.Wv.shape)
        queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
        keys = x @ self.Wk # Matrix multiplication to transform input embeddings into keys
        values = x @ self.Wv # Matrix multiplication to transform input embeddings into values
        print("")
        print("Queries shape:", queries.shape)
        print("Keys shape:", keys.shape)
        print("Values shape:", values.shape)

        qkt = queries @ keys.transpose(-2, -1) # Calculate QK^T
        qkt_scaled = qkt / math.sqrt(queries.size(-1)) # Scale QK^T by the dimension of the keys
        qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights
        print("")
        print("QK^T shape:", qkt.shape)

        attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
        print("")
        print("Attention output shape:", attn_output.shape)
        return attn_output 

attention = SelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)
Attention input shape: torch.Size([2, 4, 6])

Query weights shape: torch.Size([6, 6])
Key weights shape: torch.Size([6, 6])
Value weights shape: torch.Size([6, 6])

Queries shape: torch.Size([2, 4, 6])
Keys shape: torch.Size([2, 4, 6])
Values shape: torch.Size([2, 4, 6])

QK^T shape: torch.Size([2, 4, 4])

Attention output shape: torch.Size([2, 4, 6])

Causal Self Attention

Now that we have implemented self attention, we can move on to causal self attention. During training, we are trying to predict the next token at each time step in parallel in the transformer. However, we will be cheating if we allow attention to see future tokens during the training process. It will just predict the future tokens by looking at them. For this reason we need to mask the matrices so that future tokens are hidden from self attention layers. We perform this masking after the $QK^T$ operation [11].

image.png

The masking process makes the output of the softmax operation 0 in the upper right corner. This makes it to where the following occurs:

When we say the query is able to be reframed, what we mean mathematically is that the value in that matrix entry could possibly be over 0.

We can modify our self attention block above to add masking with the following changes:

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
        self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
        self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values

    def forward(self, x):
        seq_len = x.shape[1] # Get sequence length (number of tokens / context window length)
        queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
        keys = x @ self.Wk    # Matrix multiplication to transform input embeddings into keys
        values = x @ self.Wv  # Matrix multiplication to transform input embeddings into values
        qkt = queries @ keys.transpose(-2, -1)  # Calculate QK^T
        qkt_scaled = qkt / math.sqrt(queries.size(-1))  # Scale QK^T by the dimension of the keys

        # MASKING
        # THIS IS THE ONLY DIFFERENCE, USE -inf FOR UPPER TRIANGLE MASK SO THAT SOFTMAX WILL BE 0
        causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1)
        causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf'))  # Upper triangle masked with -inf 
        qkt_scaled = qkt_scaled + causal_mask # Add the mask to the scaled QK^T
        # END MASKING

        qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights, the -inf values will become 0 here
        attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
        return attn_output


attention = CausalSelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)
print(test_out.shape)  # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])

Multi-Headed Causal Self Attention

Now we have causal self attention, we can add in the "multi-headed" part of the attention layer. Multi headed attention splits the attention operation in parallel, allowing multiple "heads" to each have their own learned QKV weights.

Multi-Headed Causal Self Attention intuition

What is this actually doing conceptually? It is allowing each head to have the tokens attend to each other in different ways. For instance one head might be focusing on grammatical structure, another might be focusing on semantic meaning, while another based on real-world meaning. If viewing the sentence "the sky is blue" from a grammatical structure perspective, the word "the" might attend to the word "sky" heavily becuase that is what it is referring to. However if viewing attention through the lense of real-world meaning, the word "the" won't attend to the word "sky" very much becuase their meanings are not similar. Each word's relationship to the other words might be different depending on what "lens" (or "head") you are viewing them through.

To reiterate, this is a helpful conceptual way to think about multi-headed attention, but the meanings of each head is not always human understandable in this way. They are going take on whatever meaning helps minimize the loss function of the training set the most.

The final output of Multi-Headed Causal Self Attention is the same size as the input to the self attention layer.

Multi-Headed Causal Self Attention Steps

Below are an outline of all the steps in multi headed causal self attention. The steps shown below will map specifically to PyTorch code in the subsequent segment. These steps are meant to help visualize what is happening in the full attention operation.

Step 1: Multiply Input by Wqkv

In the above sections when referring to Wq, Wk, and Wv, we referred to them as separate matrices. While that is true and helpful conceptually, we concatenate them into one matrix to make the multi-headed self attention operation more efficient.

The first step is to multiply x by this weight matrix. This is done through a standard PyTorch linear layer. The resulting matrix will be our query, key, and value matrices concatenated.

image.png

Step 2: Split the Q, K, V Matrices

Using the split operation in PyTorch, we can split out the Q, K, and V matrices back to individual matrices.

image-2.png

Step 3: Reshape the Q, K, V Matrices Into Heads

Now that we have Q, K, and V Matrices, we can reshape them into heads. This operation should illustrate why in multi-headed self attention, it is required that the embedding dimension be divisible by the number of heads. The image below shows reshaping the Q matrix, but it should also be done for the K and V matrices in the same way.

image-3.png

Step 4: QK^T

Now we can perform the QK^T operation to get the query/key matches. This operation is the same as shown in self attention above, but now we have multiple heads. In our example we have 3 heads. All this means is that we are doing batch matrix multiplication, with the QK^T operation happening for each head in parallel. This means we have different query/key matches for each head.

image-4.png

Step 5: Mask Before Softmax

We take the result and apply the causal mask before softmax operation just like above. The main difference here is that the mask is applied to all 3 heads in parallel.

image-5.png

Step 6: Softmax & Multiply by V

We can then normalize and multiply by V to get the attended values.

image-6.png

Step 7: Merge Heads

We now have "V attended" which has 3 heads. We can merge these back together into a single matrix before sending them through a feedforward layer.

image-8.png

Step 8: Projection Layer

Finally, we feed the attended values through a linear layer, to get the final attention output. This final layer allows information to be combined and mixed between the heads, and projects the shape to match the input shape.

The final attention output can be thought of as the input tokens, but now cross pollinated with information from their interactions with each other.

image-7.png

Multi-Headed Causal Self Attention Code

The following code snippet shows an implementation of multi-headed causal self attention, building on our previous attention blocks. This is not the most compute efficient implementation due to the for loop for each head, but it is easier to read than the fully vectorized version and works for our use case due to the small datasets we are using.

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0, "n_embd must be divisible by n_head"

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head

        self.Wqkv = nn.Linear(self.n_embd, 3 * self.n_embd, bias=False).to(device)
        self.proj = nn.Linear(self.n_embd, self.n_embd, bias=False).to(device)

        # Causal mask to ensure that attention is only applied to previous tokens in the sequence
        mask = torch.tril(
            torch.ones(config.seq_len, config.seq_len, device=device, dtype=torch.bool)
        )
        self.register_buffer("causal_mask", mask.view(1, 1, config.seq_len, config.seq_len))

    def forward(self, x):
        B, seq_len, n_embd = x.shape  # (batch, time, channels)

        # 1) Multiply input by Wqkv to get queries, keys, values
        qkv = self.Wqkv(x)  # (B, seq_len, 3n_embd)

        # 2) Split the Q, K, V matrices
        q, k, v = qkv.split(n_embd, dim=2)  # each (B, seq_len, n_embd)
     
        # 3) Reshape the Q, K, V Matrices Into Heads
        #    (B,T,C) -> (B, seq_len, n_head, head_dim) -> (B, n_head, seq_len, head_dim)
        q = q.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
      

        # 4) QK^T
        #    (B, n_head, seq_len, seq_len)
        att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        # 5) Mask Before Softmax
        mask = self.causal_mask[:, :, :seq_len, :seq_len]
        att = att.masked_fill(~mask, float("-inf"))

        # 6) Softmax & Multiply by V
        #    (B, n_head, seq_len, head_dim)
        att = F.softmax(att, dim=-1)
        y = att @ v        

        # 7) Merge heads: 
        #    (B, n_head, seq_len, head_dim) -> (B, seq_len, embedding_dim)
        y = y.transpose(1, 2).contiguous().view(B, seq_len, n_embd)

        # 8) Projection Layer 
        y = self.proj(y)
        return y

multihead_attn = MultiHeadAttention(test_config)
test_out = multihead_attn(test_embeddings_with_pos)
print(test_out.shape)  # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])

1.3.5 The Block

We have now succesfully implemented multi-headed attention. There are just a few steps left until we have a GPT "block" that we can stack onto the network over and over again. The architecture of a GPT block is as follows:

image-2.png

So far we have built the text embedding, positional encoding, and masked multiheaded self attention parts. Now we need to add in the normalization layers and the feedforward layers. These are straightforward pytorch layers that are common across many neural network architectures.

Layer normalization layers

The layer normalization layers are straghtforward and used in many deep learning architectures. It normalizes the values of the incoming matrix across the feature dimension (in our case dimension 2). It is used to stabilize training and achieve faster convergence.

Feedforward layer

The feedforward layer of the transformer block operates with a different paradigm than attention. While attention captures relationships between tokens, the feedforward layer applies the same transformation to each token in parallel. It can be implemented using standard pytorch linear layers. We are using a factor of 4 x embedding dimension for the size of the linear layer, as was done in the original attention is all you need paper. We use the Gaussian Error Linear Unit (GELU) activation function as is implemented in the original GPT paper.

class GPTBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.mha = MultiHeadAttention(config)
        self.ln1 = nn.LayerNorm(config.n_embd).to(device)
        self.ffn = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
        ).to(device)
        self.ln2 = nn.LayerNorm(config.n_embd).to(device)

    def forward(self, x):
        x = x + self.mha(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

block = GPTBlock(test_config)
test_out = block(test_embeddings_with_pos)
print(test_out.shape)  # Output should have shape: (batch_size, seq_len, n_embd)
torch.Size([2, 4, 6])

1.3.6 Putting it All Together

Now that we have a block, we can stack the blocks together multiple times to have a GPT style LLM model

class GPTModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd).to(device)
        self.position_encoding = get_position_encoding(config.seq_len, config.n_embd)
        self.blocks = nn.Sequential(*[GPTBlock(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd).to(device)
        self.head = nn.Linear(config.n_embd, config.vocab_size).to(device)
    
    def forward(self, x):
        x = self.token_embedding(x) + self.position_encoding
        x = self.blocks(x)
        x = self.ln_f(x)
        return self.head(x)
    
gpt = GPTModel(test_config)
print(test_batch_inputs.shape)
test_out = gpt(test_batch_inputs)
print(test_out.shape)  
torch.Size([2, 4])
torch.Size([2, 4, 50257])

That is a full forward pass through the LLM, the input is of shape $[batch,tokens]$ and the output is of shape $[batch,tokens,probabilities]$. For each token given in the input, the LLM will predict a discrete probability distribution of the next token that comes after that.

The transformer makes multiple predictions of this in parallel, one for each token in the input. While all of them are used in training, only the last prediction (of token n) is used in inference to to the final predition.

The following diagram shows the full forward pass with shapes as one example moves through the matrix.

image-2.png

1.3.7 Dummy Training Loop

Now that we have gone through the forward pass of the model, we can train it. The model is trained using next token prediction

Objective Function

According to the original GPT paper, the objective function of pretraining is the following [1]:

$$L1(U) = \sum_{i}logP(u_i|u_{i-k}...u_{i-1};\theta)$$

Maximizing this objective function is essentially the same as minimizing the cross entropy loss function.

$$H(p, q) = -\sum_{x} p(x) \log q(x)$$

This is becuase during training, we use a one hot encoded vector for the true distribution, so p(x) is 1 for the correct token, and 0 for all other tokens. This means we can remove the sum and simplify the cross entropy loss to this:

$$H(p, q) = -\log P(u_i \mid u_{i-k}, \dots, u_{i-1}; \theta)$$

Pytorch has a pre-built cross-entropy loss function that can be used as our criterion to minimize [12].

Test One: Overfitting

We will first train the model with a small dataset (10 examples) and see if we can get the model to memorize/overfit to the dataset. This is a good test to ensure that our architecture is correct and getting the loss to reduce as expected.

# Example config:
batch_size = 10
sequence_len = 128
num_steps = 1000
train_inputs, train_targets, _, _ = get_dataset(10, sequence_len, 0)
config = GPTConfig(
    vocab_size=tokenizer.n_vocab,
    n_layer=4,   # fewer layers for a quick demo
    n_head=4,
    n_embd=128,
    seq_len=sequence_len,
)


# Create the GPT model
model = GPTModel(config)

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

# Define Scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=20, min_lr=5e-6, threshold=1e-4)

# Training loop
i = 1
losses = []

while i < num_steps:
    for j in range(0, len(train_inputs), batch_size):
        x = train_inputs[j:j+batch_size]
        y = train_targets[j:j+batch_size]

        # Forward pass
        logits = model(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        losses.append(loss.item())
        
        optimizer.step()
        optimizer.zero_grad()
    

        loss = loss.item()
        scheduler.step(loss)

   
        # Print the average loss for the epoch
        lr = optimizer.param_groups[0]["lr"]
        print(f"Step {i+1}/{num_steps}, Loss: {loss}, LR: {lr}")

        i += 1
Step 2/1000, Loss: 11.028936386108398, LR: 0.0005
Step 3/1000, Loss: 10.818150520324707, LR: 0.0005
Step 4/1000, Loss: 10.620841026306152, LR: 0.0005
Step 5/1000, Loss: 10.440305709838867, LR: 0.0005
Step 6/1000, Loss: 10.277081489562988, LR: 0.0005
Step 7/1000, Loss: 10.129175186157227, LR: 0.0005
Step 8/1000, Loss: 9.992925643920898, LR: 0.0005
Step 9/1000, Loss: 9.864274978637695, LR: 0.0005
Step 10/1000, Loss: 9.739694595336914, LR: 0.0005
Step 11/1000, Loss: 9.616449356079102, LR: 0.0005
Step 12/1000, Loss: 9.492518424987793, LR: 0.0005
Step 13/1000, Loss: 9.366469383239746, LR: 0.0005
Step 14/1000, Loss: 9.23737621307373, LR: 0.0005
Step 15/1000, Loss: 9.104846000671387, LR: 0.0005
Step 16/1000, Loss: 8.969032287597656, LR: 0.0005
Step 17/1000, Loss: 8.830738067626953, LR: 0.0005
Step 18/1000, Loss: 8.691510200500488, LR: 0.0005
Step 19/1000, Loss: 8.55342960357666, LR: 0.0005
Step 20/1000, Loss: 8.41820240020752, LR: 0.0005
Step 21/1000, Loss: 8.285758972167969, LR: 0.0005
Step 22/1000, Loss: 8.154431343078613, LR: 0.0005
Step 23/1000, Loss: 8.022896766662598, LR: 0.0005
Step 24/1000, Loss: 7.890986442565918, LR: 0.0005
Step 25/1000, Loss: 7.758973121643066, LR: 0.0005
Step 26/1000, Loss: 7.627130031585693, LR: 0.0005
Step 27/1000, Loss: 7.4958906173706055, LR: 0.0005
Step 28/1000, Loss: 7.365780830383301, LR: 0.0005
Step 29/1000, Loss: 7.237158298492432, LR: 0.0005
Step 30/1000, Loss: 7.110237121582031, LR: 0.0005
Step 31/1000, Loss: 6.985257148742676, LR: 0.0005
Step 32/1000, Loss: 6.862329006195068, LR: 0.0005
Step 33/1000, Loss: 6.741218566894531, LR: 0.0005
Step 34/1000, Loss: 6.621609687805176, LR: 0.0005
Step 35/1000, Loss: 6.503492832183838, LR: 0.0005
Step 36/1000, Loss: 6.387082576751709, LR: 0.0005
Step 37/1000, Loss: 6.272494792938232, LR: 0.0005
Step 38/1000, Loss: 6.1597442626953125, LR: 0.0005
Step 39/1000, Loss: 6.0489373207092285, LR: 0.0005
Step 40/1000, Loss: 5.94016170501709, LR: 0.0005
Step 41/1000, Loss: 5.8332905769348145, LR: 0.0005
Step 42/1000, Loss: 5.728141784667969, LR: 0.0005
Step 43/1000, Loss: 5.624641418457031, LR: 0.0005
Step 44/1000, Loss: 5.522706031799316, LR: 0.0005
Step 45/1000, Loss: 5.422231197357178, LR: 0.0005
Step 46/1000, Loss: 5.323247909545898, LR: 0.0005
Step 47/1000, Loss: 5.225776672363281, LR: 0.0005
Step 48/1000, Loss: 5.129720687866211, LR: 0.0005
Step 49/1000, Loss: 5.034989833831787, LR: 0.0005
Step 50/1000, Loss: 4.941556453704834, LR: 0.0005
Step 51/1000, Loss: 4.849225997924805, LR: 0.0005
Step 52/1000, Loss: 4.757768630981445, LR: 0.0005
Step 53/1000, Loss: 4.667154788970947, LR: 0.0005
Step 54/1000, Loss: 4.57724666595459, LR: 0.0005
Step 55/1000, Loss: 4.4878716468811035, LR: 0.0005
Step 56/1000, Loss: 4.399011135101318, LR: 0.0005
Step 57/1000, Loss: 4.310593605041504, LR: 0.0005
Step 58/1000, Loss: 4.222630977630615, LR: 0.0005
Step 59/1000, Loss: 4.135863304138184, LR: 0.0005
Step 60/1000, Loss: 4.05898904800415, LR: 0.0005
Step 61/1000, Loss: 3.992307662963867, LR: 0.0005
Step 62/1000, Loss: 3.886232852935791, LR: 0.0005
Step 63/1000, Loss: 3.813814640045166, LR: 0.0005
Step 64/1000, Loss: 3.74583101272583, LR: 0.0005
Step 65/1000, Loss: 3.652099609375, LR: 0.0005
Step 66/1000, Loss: 3.586872100830078, LR: 0.0005
Step 67/1000, Loss: 3.505384922027588, LR: 0.0005
Step 68/1000, Loss: 3.4279913902282715, LR: 0.0005
Step 69/1000, Loss: 3.3585686683654785, LR: 0.0005
Step 70/1000, Loss: 3.275824785232544, LR: 0.0005
Step 71/1000, Loss: 3.2014312744140625, LR: 0.0005
Step 72/1000, Loss: 3.131232976913452, LR: 0.0005
Step 73/1000, Loss: 3.049182891845703, LR: 0.0005
Step 74/1000, Loss: 2.9816317558288574, LR: 0.0005
Step 75/1000, Loss: 2.9068069458007812, LR: 0.0005
Step 76/1000, Loss: 2.8312482833862305, LR: 0.0005
Step 77/1000, Loss: 2.7614734172821045, LR: 0.0005
Step 78/1000, Loss: 2.6829771995544434, LR: 0.0005
Step 79/1000, Loss: 2.6195552349090576, LR: 0.0005
Step 80/1000, Loss: 2.5445303916931152, LR: 0.0005
Step 81/1000, Loss: 2.4811675548553467, LR: 0.0005
Step 82/1000, Loss: 2.4033358097076416, LR: 0.0005
Step 83/1000, Loss: 2.3365752696990967, LR: 0.0005
Step 84/1000, Loss: 2.286520481109619, LR: 0.0005
Step 85/1000, Loss: 2.2043795585632324, LR: 0.0005
Step 86/1000, Loss: 2.1443023681640625, LR: 0.0005
Step 87/1000, Loss: 2.0734801292419434, LR: 0.0005
Step 88/1000, Loss: 2.041525363922119, LR: 0.0005
Step 89/1000, Loss: 1.9522135257720947, LR: 0.0005
Step 90/1000, Loss: 1.938239336013794, LR: 0.0005
Step 91/1000, Loss: 1.8657023906707764, LR: 0.0005
Step 92/1000, Loss: 1.8048756122589111, LR: 0.0005
Step 93/1000, Loss: 1.7578691244125366, LR: 0.0005
Step 94/1000, Loss: 1.701775312423706, LR: 0.0005
Step 95/1000, Loss: 1.6422252655029297, LR: 0.0005
Step 96/1000, Loss: 1.6149051189422607, LR: 0.0005
Step 97/1000, Loss: 1.5551875829696655, LR: 0.0005
Step 98/1000, Loss: 1.5064985752105713, LR: 0.0005
Step 99/1000, Loss: 1.4643620252609253, LR: 0.0005
Step 100/1000, Loss: 1.4139996767044067, LR: 0.0005
Step 101/1000, Loss: 1.385868787765503, LR: 0.0005
Step 102/1000, Loss: 1.3312125205993652, LR: 0.0005
Step 103/1000, Loss: 1.2840286493301392, LR: 0.0005
Step 104/1000, Loss: 1.2414436340332031, LR: 0.0005
Step 105/1000, Loss: 1.1958096027374268, LR: 0.0005
Step 106/1000, Loss: 1.1562128067016602, LR: 0.0005
Step 107/1000, Loss: 1.1319259405136108, LR: 0.0005
Step 108/1000, Loss: 1.0711021423339844, LR: 0.0005
Step 109/1000, Loss: 1.047868013381958, LR: 0.0005
Step 110/1000, Loss: 1.0212820768356323, LR: 0.0005
Step 111/1000, Loss: 0.9612825512886047, LR: 0.0005
Step 112/1000, Loss: 0.973650336265564, LR: 0.0005
Step 113/1000, Loss: 0.9346786737442017, LR: 0.0005
Step 114/1000, Loss: 0.9234967231750488, LR: 0.0005
Step 115/1000, Loss: 0.8635486364364624, LR: 0.0005
Step 116/1000, Loss: 0.838901698589325, LR: 0.0005
Step 117/1000, Loss: 0.8201435804367065, LR: 0.0005
Step 118/1000, Loss: 0.7784948348999023, LR: 0.0005
Step 119/1000, Loss: 0.7447291612625122, LR: 0.0005
Step 120/1000, Loss: 0.7342116236686707, LR: 0.0005
Step 121/1000, Loss: 0.7056399583816528, LR: 0.0005
Step 122/1000, Loss: 0.6909168362617493, LR: 0.0005
Step 123/1000, Loss: 0.6604924201965332, LR: 0.0005
Step 124/1000, Loss: 0.6375840306282043, LR: 0.0005
Step 125/1000, Loss: 0.6168572902679443, LR: 0.0005
Step 126/1000, Loss: 0.5916572213172913, LR: 0.0005
Step 127/1000, Loss: 0.5694154500961304, LR: 0.0005
Step 128/1000, Loss: 0.5482650995254517, LR: 0.0005
Step 129/1000, Loss: 0.5271603465080261, LR: 0.0005
Step 130/1000, Loss: 0.5043598413467407, LR: 0.0005
Step 131/1000, Loss: 0.484113872051239, LR: 0.0005
Step 132/1000, Loss: 0.4653652310371399, LR: 0.0005
Step 133/1000, Loss: 0.44524794816970825, LR: 0.0005
Step 134/1000, Loss: 0.425358384847641, LR: 0.0005
Step 135/1000, Loss: 0.40840944647789, LR: 0.0005
Step 136/1000, Loss: 0.3909088373184204, LR: 0.0005
Step 137/1000, Loss: 0.37400251626968384, LR: 0.0005
Step 138/1000, Loss: 0.3574616611003876, LR: 0.0005
Step 139/1000, Loss: 0.3430023789405823, LR: 0.0005
Step 140/1000, Loss: 0.32840004563331604, LR: 0.0005
Step 141/1000, Loss: 0.31408971548080444, LR: 0.0005
Step 142/1000, Loss: 0.30126383900642395, LR: 0.0005
Step 143/1000, Loss: 0.2889065146446228, LR: 0.0005
Step 144/1000, Loss: 0.27704042196273804, LR: 0.0005
Step 145/1000, Loss: 0.265825480222702, LR: 0.0005
Step 146/1000, Loss: 0.25510314106941223, LR: 0.0005
Step 147/1000, Loss: 0.24541497230529785, LR: 0.0005
Step 148/1000, Loss: 0.23576955497264862, LR: 0.0005
Step 149/1000, Loss: 0.226886585354805, LR: 0.0005
Step 150/1000, Loss: 0.2183711975812912, LR: 0.0005
Step 151/1000, Loss: 0.21046686172485352, LR: 0.0005
Step 152/1000, Loss: 0.20287629961967468, LR: 0.0005
Step 153/1000, Loss: 0.19582954049110413, LR: 0.0005
Step 154/1000, Loss: 0.18901699781417847, LR: 0.0005
Step 155/1000, Loss: 0.18258051574230194, LR: 0.0005
Step 156/1000, Loss: 0.17666363716125488, LR: 0.0005
Step 157/1000, Loss: 0.1709659993648529, LR: 0.0005
Step 158/1000, Loss: 0.16547448933124542, LR: 0.0005
Step 159/1000, Loss: 0.1603526622056961, LR: 0.0005
Step 160/1000, Loss: 0.15554237365722656, LR: 0.0005
Step 161/1000, Loss: 0.15095333755016327, LR: 0.0005
Step 162/1000, Loss: 0.14654694497585297, LR: 0.0005
Step 163/1000, Loss: 0.14236226677894592, LR: 0.0005
Step 164/1000, Loss: 0.13846130669116974, LR: 0.0005
Step 165/1000, Loss: 0.13469958305358887, LR: 0.0005
Step 166/1000, Loss: 0.13111920654773712, LR: 0.0005
Step 167/1000, Loss: 0.12771472334861755, LR: 0.0005
Step 168/1000, Loss: 0.12444841861724854, LR: 0.0005
Step 169/1000, Loss: 0.1213715523481369, LR: 0.0005
Step 170/1000, Loss: 0.1184251680970192, LR: 0.0005
Step 171/1000, Loss: 0.11558564007282257, LR: 0.0005
Step 172/1000, Loss: 0.11287212371826172, LR: 0.0005
Step 173/1000, Loss: 0.11030000448226929, LR: 0.0005
Step 174/1000, Loss: 0.10783600807189941, LR: 0.0005
Step 175/1000, Loss: 0.10544748604297638, LR: 0.0005
Step 176/1000, Loss: 0.10317704826593399, LR: 0.0005
Step 177/1000, Loss: 0.10100039094686508, LR: 0.0005
Step 178/1000, Loss: 0.09890035539865494, LR: 0.0005
Step 179/1000, Loss: 0.09689100831747055, LR: 0.0005
Step 180/1000, Loss: 0.09495634585618973, LR: 0.0005
Step 181/1000, Loss: 0.09309490770101547, LR: 0.0005
Step 182/1000, Loss: 0.09130214154720306, LR: 0.0005
Step 183/1000, Loss: 0.0895790159702301, LR: 0.0005
Step 184/1000, Loss: 0.08791748434305191, LR: 0.0005
Step 185/1000, Loss: 0.08630998432636261, LR: 0.0005
Step 186/1000, Loss: 0.08476689457893372, LR: 0.0005
Step 187/1000, Loss: 0.08327476680278778, LR: 0.0005
Step 188/1000, Loss: 0.0818280428647995, LR: 0.0005
Step 189/1000, Loss: 0.08043530583381653, LR: 0.0005
Step 190/1000, Loss: 0.07908846437931061, LR: 0.0005
Step 191/1000, Loss: 0.07778267562389374, LR: 0.0005
Step 192/1000, Loss: 0.07651858031749725, LR: 0.0005
Step 193/1000, Loss: 0.07529474794864655, LR: 0.0005
Step 194/1000, Loss: 0.0741095021367073, LR: 0.0005
Step 195/1000, Loss: 0.07295864820480347, LR: 0.0005
Step 196/1000, Loss: 0.07184220105409622, LR: 0.0005
Step 197/1000, Loss: 0.07076020538806915, LR: 0.0005
Step 198/1000, Loss: 0.06970785558223724, LR: 0.0005
Step 199/1000, Loss: 0.0686848908662796, LR: 0.0005
Step 200/1000, Loss: 0.06769225746393204, LR: 0.0005
Step 201/1000, Loss: 0.06672650575637817, LR: 0.0005
Step 202/1000, Loss: 0.06578618288040161, LR: 0.0005
Step 203/1000, Loss: 0.06487169116735458, LR: 0.0005
Step 204/1000, Loss: 0.06398141384124756, LR: 0.0005
Step 205/1000, Loss: 0.06311395764350891, LR: 0.0005
Step 206/1000, Loss: 0.06226874515414238, LR: 0.0005
Step 207/1000, Loss: 0.061445146799087524, LR: 0.0005
Step 208/1000, Loss: 0.060642071068286896, LR: 0.0005
Step 209/1000, Loss: 0.05985843390226364, LR: 0.0005
Step 210/1000, Loss: 0.059093911200761795, LR: 0.0005
Step 211/1000, Loss: 0.058347903192043304, LR: 0.0005
Step 212/1000, Loss: 0.05761934444308281, LR: 0.0005
Step 213/1000, Loss: 0.056907713413238525, LR: 0.0005
Step 214/1000, Loss: 0.05621255189180374, LR: 0.0005
Step 215/1000, Loss: 0.05553319305181503, LR: 0.0005
Step 216/1000, Loss: 0.05486900731921196, LR: 0.0005
Step 217/1000, Loss: 0.05421954393386841, LR: 0.0005
Step 218/1000, Loss: 0.053584396839141846, LR: 0.0005
Step 219/1000, Loss: 0.05296294763684273, LR: 0.0005
Step 220/1000, Loss: 0.052354730665683746, LR: 0.0005
Step 221/1000, Loss: 0.05175943300127983, LR: 0.0005
Step 222/1000, Loss: 0.05117657035589218, LR: 0.0005
Step 223/1000, Loss: 0.05060570314526558, LR: 0.0005
Step 224/1000, Loss: 0.050046540796756744, LR: 0.0005
Step 225/1000, Loss: 0.04949868842959404, LR: 0.0005
Step 226/1000, Loss: 0.04896175116300583, LR: 0.0005
Step 227/1000, Loss: 0.04843541607260704, LR: 0.0005
Step 228/1000, Loss: 0.047919414937496185, LR: 0.0005
Step 229/1000, Loss: 0.04741339758038521, LR: 0.0005
Step 230/1000, Loss: 0.04691707342863083, LR: 0.0005
Step 231/1000, Loss: 0.046430159360170364, LR: 0.0005
Step 232/1000, Loss: 0.045952390879392624, LR: 0.0005
Step 233/1000, Loss: 0.04548350349068642, LR: 0.0005
Step 234/1000, Loss: 0.04502324387431145, LR: 0.0005
Step 235/1000, Loss: 0.04457138478755951, LR: 0.0005
Step 236/1000, Loss: 0.044127706438302994, LR: 0.0005
Step 237/1000, Loss: 0.0436919629573822, LR: 0.0005
Step 238/1000, Loss: 0.043263912200927734, LR: 0.0005
Step 239/1000, Loss: 0.042843401432037354, LR: 0.0005
Step 240/1000, Loss: 0.04243022948503494, LR: 0.0005
Step 241/1000, Loss: 0.0420241579413414, LR: 0.0005
Step 242/1000, Loss: 0.041625045239925385, LR: 0.0005
Step 243/1000, Loss: 0.04123268276453018, LR: 0.0005
Step 244/1000, Loss: 0.04084692522883415, LR: 0.0005
Step 245/1000, Loss: 0.040467601269483566, LR: 0.0005
Step 246/1000, Loss: 0.040094539523124695, LR: 0.0005
Step 247/1000, Loss: 0.0397275909781456, LR: 0.0005
Step 248/1000, Loss: 0.039366595447063446, LR: 0.0005
Step 249/1000, Loss: 0.03901142627000809, LR: 0.0005
Step 250/1000, Loss: 0.03866194561123848, LR: 0.0005
Step 251/1000, Loss: 0.038318000733852386, LR: 0.0005
Step 252/1000, Loss: 0.03797946125268936, LR: 0.0005
Step 253/1000, Loss: 0.037646204233169556, LR: 0.0005
Step 254/1000, Loss: 0.037318117916584015, LR: 0.0005
Step 255/1000, Loss: 0.03699507564306259, LR: 0.0005
Step 256/1000, Loss: 0.036676958203315735, LR: 0.0005
Step 257/1000, Loss: 0.03636365383863449, LR: 0.0005
Step 258/1000, Loss: 0.0360550656914711, LR: 0.0005
Step 259/1000, Loss: 0.03575107082724571, LR: 0.0005
Step 260/1000, Loss: 0.035451583564281464, LR: 0.0005
Step 261/1000, Loss: 0.0351564958691597, LR: 0.0005
Step 262/1000, Loss: 0.034865714609622955, LR: 0.0005
Step 263/1000, Loss: 0.03457915037870407, LR: 0.0005
Step 264/1000, Loss: 0.03429670259356499, LR: 0.0005
Step 265/1000, Loss: 0.03401828929781914, LR: 0.0005
Step 266/1000, Loss: 0.03374383971095085, LR: 0.0005
Step 267/1000, Loss: 0.03347324579954147, LR: 0.0005
Step 268/1000, Loss: 0.033206455409526825, LR: 0.0005
Step 269/1000, Loss: 0.03294336050748825, LR: 0.0005
Step 270/1000, Loss: 0.032683901488780975, LR: 0.0005
Step 271/1000, Loss: 0.03242800012230873, LR: 0.0005
Step 272/1000, Loss: 0.032175593078136444, LR: 0.0005
Step 273/1000, Loss: 0.03192659839987755, LR: 0.0005
Step 274/1000, Loss: 0.03168096765875816, LR: 0.0005
Step 275/1000, Loss: 0.03143859654664993, LR: 0.0005
Step 276/1000, Loss: 0.031199470162391663, LR: 0.0005
Step 277/1000, Loss: 0.030963491648435593, LR: 0.0005
Step 278/1000, Loss: 0.03073061630129814, LR: 0.0005
Step 279/1000, Loss: 0.030500758439302444, LR: 0.0005
Step 280/1000, Loss: 0.030273890122771263, LR: 0.0005
Step 281/1000, Loss: 0.030049938708543777, LR: 0.0005
Step 282/1000, Loss: 0.02982885204255581, LR: 0.0005
Step 283/1000, Loss: 0.029610583558678627, LR: 0.0005
Step 284/1000, Loss: 0.02939506433904171, LR: 0.0005
Step 285/1000, Loss: 0.02918226085603237, LR: 0.0005
Step 286/1000, Loss: 0.02897210791707039, LR: 0.0005
Step 287/1000, Loss: 0.028764575719833374, LR: 0.0005
Step 288/1000, Loss: 0.0285596065223217, LR: 0.0005
Step 289/1000, Loss: 0.02835715375840664, LR: 0.0005
Step 290/1000, Loss: 0.02815716341137886, LR: 0.0005
Step 291/1000, Loss: 0.027959603816270828, LR: 0.0005
Step 292/1000, Loss: 0.02776442840695381, LR: 0.0005
Step 293/1000, Loss: 0.027571607381105423, LR: 0.0005
Step 294/1000, Loss: 0.02738107182085514, LR: 0.0005
Step 295/1000, Loss: 0.02719280496239662, LR: 0.0005
Step 296/1000, Loss: 0.027006756514310837, LR: 0.0005
Step 297/1000, Loss: 0.026822898536920547, LR: 0.0005
Step 298/1000, Loss: 0.02664119005203247, LR: 0.0005
Step 299/1000, Loss: 0.026461582630872726, LR: 0.0005
Step 300/1000, Loss: 0.026284050196409225, LR: 0.0005
Step 301/1000, Loss: 0.02610856667160988, LR: 0.0005
Step 302/1000, Loss: 0.025935083627700806, LR: 0.0005
Step 303/1000, Loss: 0.025763574987649918, LR: 0.0005
Step 304/1000, Loss: 0.025593992322683334, LR: 0.0005
Step 305/1000, Loss: 0.025426337495446205, LR: 0.0005
Step 306/1000, Loss: 0.025260552763938904, LR: 0.0005
Step 307/1000, Loss: 0.025096606463193893, LR: 0.0005
Step 308/1000, Loss: 0.024934494867920876, LR: 0.0005
Step 309/1000, Loss: 0.024774150922894478, LR: 0.0005
Step 310/1000, Loss: 0.0246155746281147, LR: 0.0005
Step 311/1000, Loss: 0.02445872686803341, LR: 0.0005
Step 312/1000, Loss: 0.024303575977683067, LR: 0.0005
Step 313/1000, Loss: 0.024150118231773376, LR: 0.0005
Step 314/1000, Loss: 0.02399829402565956, LR: 0.0005
Step 315/1000, Loss: 0.023848097771406174, LR: 0.0005
Step 316/1000, Loss: 0.023699497804045677, LR: 0.0005
Step 317/1000, Loss: 0.023552479222416878, LR: 0.0005
Step 318/1000, Loss: 0.02340700477361679, LR: 0.0005
Step 319/1000, Loss: 0.023263057693839073, LR: 0.0005
Step 320/1000, Loss: 0.023120611906051636, LR: 0.0005
Step 321/1000, Loss: 0.02297966368496418, LR: 0.0005
Step 322/1000, Loss: 0.022840162739157677, LR: 0.0005
Step 323/1000, Loss: 0.022702092304825783, LR: 0.0005
Step 324/1000, Loss: 0.02256545051932335, LR: 0.0005
Step 325/1000, Loss: 0.022430190816521645, LR: 0.0005
Step 326/1000, Loss: 0.02229630947113037, LR: 0.0005
Step 327/1000, Loss: 0.02216377481818199, LR: 0.0005
Step 328/1000, Loss: 0.022032588720321655, LR: 0.0005
Step 329/1000, Loss: 0.021902697160840034, LR: 0.0005
Step 330/1000, Loss: 0.021774107590317726, LR: 0.0005
Step 331/1000, Loss: 0.02164679393172264, LR: 0.0005
Step 332/1000, Loss: 0.021520748734474182, LR: 0.0005
Step 333/1000, Loss: 0.021395931020379066, LR: 0.0005
Step 334/1000, Loss: 0.02127234637737274, LR: 0.0005
Step 335/1000, Loss: 0.021149959415197372, LR: 0.0005
Step 336/1000, Loss: 0.02102876640856266, LR: 0.0005
Step 337/1000, Loss: 0.020908737555146217, LR: 0.0005
Step 338/1000, Loss: 0.020789876580238342, LR: 0.0005
Step 339/1000, Loss: 0.020672138780355453, LR: 0.0005
Step 340/1000, Loss: 0.020555540919303894, LR: 0.0005
Step 341/1000, Loss: 0.020440038293600082, LR: 0.0005
Step 342/1000, Loss: 0.02032562717795372, LR: 0.0005
Step 343/1000, Loss: 0.020212315022945404, LR: 0.0005
Step 344/1000, Loss: 0.020100053399801254, LR: 0.0005
Step 345/1000, Loss: 0.019988834857940674, LR: 0.0005
Step 346/1000, Loss: 0.019878655672073364, LR: 0.0005
Step 347/1000, Loss: 0.01976950094103813, LR: 0.0005
Step 348/1000, Loss: 0.019661344587802887, LR: 0.0005
Step 349/1000, Loss: 0.019554201513528824, LR: 0.0005
Step 350/1000, Loss: 0.01944802701473236, LR: 0.0005
Step 351/1000, Loss: 0.0193428136408329, LR: 0.0005
Step 352/1000, Loss: 0.01923857256770134, LR: 0.0005
Step 353/1000, Loss: 0.019135264679789543, LR: 0.0005
Step 354/1000, Loss: 0.019032880663871765, LR: 0.0005
Step 355/1000, Loss: 0.018931429833173752, LR: 0.0005
Step 356/1000, Loss: 0.018830865621566772, LR: 0.0005
Step 357/1000, Loss: 0.01873120665550232, LR: 0.0005
Step 358/1000, Loss: 0.01863243244588375, LR: 0.0005
Step 359/1000, Loss: 0.01853453740477562, LR: 0.0005
Step 360/1000, Loss: 0.018437493592500687, LR: 0.0005
Step 361/1000, Loss: 0.0183413065969944, LR: 0.0005
Step 362/1000, Loss: 0.01824595034122467, LR: 0.0005
Step 363/1000, Loss: 0.018151428550481796, LR: 0.0005
Step 364/1000, Loss: 0.018057730048894882, LR: 0.0005
Step 365/1000, Loss: 0.017964834347367287, LR: 0.0005
Step 366/1000, Loss: 0.017872732132673264, LR: 0.0005
Step 367/1000, Loss: 0.01778143271803856, LR: 0.0005
Step 368/1000, Loss: 0.017690904438495636, LR: 0.0005
Step 369/1000, Loss: 0.017601147294044495, LR: 0.0005
Step 370/1000, Loss: 0.017512155696749687, LR: 0.0005
Step 371/1000, Loss: 0.017423905432224274, LR: 0.0005
Step 372/1000, Loss: 0.017336402088403702, LR: 0.0005
Step 373/1000, Loss: 0.017249632626771927, LR: 0.0005
Step 374/1000, Loss: 0.017163578420877457, LR: 0.0005
Step 375/1000, Loss: 0.017078256234526634, LR: 0.0005
Step 376/1000, Loss: 0.01699363812804222, LR: 0.0005
Step 377/1000, Loss: 0.016909711062908173, LR: 0.0005
Step 378/1000, Loss: 0.01682646945118904, LR: 0.0005
Step 379/1000, Loss: 0.016743911430239677, LR: 0.0005
Step 380/1000, Loss: 0.016662027686834335, LR: 0.0005
Step 381/1000, Loss: 0.016580814495682716, LR: 0.0005
Step 382/1000, Loss: 0.016500255092978477, LR: 0.0005
Step 383/1000, Loss: 0.016420351341366768, LR: 0.0005
Step 384/1000, Loss: 0.016341084614396095, LR: 0.0005
Step 385/1000, Loss: 0.016262447461485863, LR: 0.0005
Step 386/1000, Loss: 0.016184460371732712, LR: 0.0005
Step 387/1000, Loss: 0.016107091680169106, LR: 0.0005
Step 388/1000, Loss: 0.01603032648563385, LR: 0.0005
Step 389/1000, Loss: 0.015954164788126945, LR: 0.0005
Step 390/1000, Loss: 0.015878615900874138, LR: 0.0005
Step 391/1000, Loss: 0.01580365188419819, LR: 0.0005
Step 392/1000, Loss: 0.015729276463389397, LR: 0.0005
Step 393/1000, Loss: 0.01565549336373806, LR: 0.0005
Step 394/1000, Loss: 0.015582269057631493, LR: 0.0005
Step 395/1000, Loss: 0.015509620308876038, LR: 0.0005
Step 396/1000, Loss: 0.01543753407895565, LR: 0.0005
Step 397/1000, Loss: 0.01536600012332201, LR: 0.0005
Step 398/1000, Loss: 0.01529501099139452, LR: 0.0005
Step 399/1000, Loss: 0.015224570408463478, LR: 0.0005
Step 400/1000, Loss: 0.015154657885432243, LR: 0.0005
Step 401/1000, Loss: 0.015085290186107159, LR: 0.0005
Step 402/1000, Loss: 0.015016439370810986, LR: 0.0005
Step 403/1000, Loss: 0.014948099851608276, LR: 0.0005
Step 404/1000, Loss: 0.01488029770553112, LR: 0.0005
Step 405/1000, Loss: 0.014812981709837914, LR: 0.0005
Step 406/1000, Loss: 0.014746179804205894, LR: 0.0005
Step 407/1000, Loss: 0.014679880812764168, LR: 0.0005
Step 408/1000, Loss: 0.014614063315093517, LR: 0.0005
Step 409/1000, Loss: 0.01454873662441969, LR: 0.0005
Step 410/1000, Loss: 0.014483893290162086, LR: 0.0005
Step 411/1000, Loss: 0.01441953144967556, LR: 0.0005
Step 412/1000, Loss: 0.01435563899576664, LR: 0.0005
Step 413/1000, Loss: 0.014292215928435326, LR: 0.0005
Step 414/1000, Loss: 0.014229250140488148, LR: 0.0005
Step 415/1000, Loss: 0.014166750013828278, LR: 0.0005
Step 416/1000, Loss: 0.014104691334068775, LR: 0.0005
Step 417/1000, Loss: 0.014043083414435387, LR: 0.0005
Step 418/1000, Loss: 0.013981923460960388, LR: 0.0005
Step 419/1000, Loss: 0.013921191915869713, LR: 0.0005
Step 420/1000, Loss: 0.013860913924872875, LR: 0.0005
Step 421/1000, Loss: 0.013801050372421741, LR: 0.0005
Step 422/1000, Loss: 0.013741618022322655, LR: 0.0005
Step 423/1000, Loss: 0.01368260383605957, LR: 0.0005
Step 424/1000, Loss: 0.013624010607600212, LR: 0.0005
Step 425/1000, Loss: 0.01356582622975111, LR: 0.0005
Step 426/1000, Loss: 0.013508048839867115, LR: 0.0005
Step 427/1000, Loss: 0.01345068495720625, LR: 0.0005
Step 428/1000, Loss: 0.013393727131187916, LR: 0.0005
Step 429/1000, Loss: 0.013337147422134876, LR: 0.0005
Step 430/1000, Loss: 0.01328097004443407, LR: 0.0005
Step 431/1000, Loss: 0.013225178234279156, LR: 0.0005
Step 432/1000, Loss: 0.013169774785637856, LR: 0.0005
Step 433/1000, Loss: 0.013114752247929573, LR: 0.0005
Step 434/1000, Loss: 0.01306010503321886, LR: 0.0005
Step 435/1000, Loss: 0.013005835004150867, LR: 0.0005
Step 436/1000, Loss: 0.012951930984854698, LR: 0.0005
Step 437/1000, Loss: 0.012898397631943226, LR: 0.0005
Step 438/1000, Loss: 0.012845223769545555, LR: 0.0005
Step 439/1000, Loss: 0.012792408466339111, LR: 0.0005
Step 440/1000, Loss: 0.012739946134388447, LR: 0.0005
Step 441/1000, Loss: 0.012687849812209606, LR: 0.0005
Step 442/1000, Loss: 0.012636087834835052, LR: 0.0005
Step 443/1000, Loss: 0.012584683485329151, LR: 0.0005
Step 444/1000, Loss: 0.012533617205917835, LR: 0.0005
Step 445/1000, Loss: 0.012482896447181702, LR: 0.0005
Step 446/1000, Loss: 0.012432500720024109, LR: 0.0005
Step 447/1000, Loss: 0.012382445856928825, LR: 0.0005
Step 448/1000, Loss: 0.01233270950615406, LR: 0.0005
Step 449/1000, Loss: 0.012283317744731903, LR: 0.0005
Step 450/1000, Loss: 0.012234242632985115, LR: 0.0005
Step 451/1000, Loss: 0.012185484170913696, LR: 0.0005
Step 452/1000, Loss: 0.012137046083807945, LR: 0.0005
Step 453/1000, Loss: 0.012088925577700138, LR: 0.0005
Step 454/1000, Loss: 0.012041114270687103, LR: 0.0005
Step 455/1000, Loss: 0.01199361402541399, LR: 0.0005
Step 456/1000, Loss: 0.011946414597332478, LR: 0.0005
Step 457/1000, Loss: 0.011899525299668312, LR: 0.0005
Step 458/1000, Loss: 0.011852930299937725, LR: 0.0005
Step 459/1000, Loss: 0.011806638911366463, LR: 0.0005
Step 460/1000, Loss: 0.011760639026761055, LR: 0.0005
Step 461/1000, Loss: 0.011714932508766651, LR: 0.0005
Step 462/1000, Loss: 0.011669519357383251, LR: 0.0005
Step 463/1000, Loss: 0.011624393984675407, LR: 0.0005
Step 464/1000, Loss: 0.011579541489481926, LR: 0.0005
Step 465/1000, Loss: 0.011534983292222023, LR: 0.0005
Step 466/1000, Loss: 0.011490697041153908, LR: 0.0005
Step 467/1000, Loss: 0.011446688324213028, LR: 0.0005
Step 468/1000, Loss: 0.011402969248592854, LR: 0.0005
Step 469/1000, Loss: 0.011359510011970997, LR: 0.0005
Step 470/1000, Loss: 0.011316319927573204, LR: 0.0005
Step 471/1000, Loss: 0.011273396201431751, LR: 0.0005
Step 472/1000, Loss: 0.011230741627514362, LR: 0.0005
Step 473/1000, Loss: 0.011188351549208164, LR: 0.0005
Step 474/1000, Loss: 0.011146224103868008, LR: 0.0005
Step 475/1000, Loss: 0.01110434252768755, LR: 0.0005
Step 476/1000, Loss: 0.011062730103731155, LR: 0.0005
Step 477/1000, Loss: 0.011021362617611885, LR: 0.0005
Step 478/1000, Loss: 0.010980254039168358, LR: 0.0005
Step 479/1000, Loss: 0.010939392261207104, LR: 0.0005
Step 480/1000, Loss: 0.010898780077695847, LR: 0.0005
Step 481/1000, Loss: 0.010858409106731415, LR: 0.0005
Step 482/1000, Loss: 0.010818282142281532, LR: 0.0005
Step 483/1000, Loss: 0.010778399184346199, LR: 0.0005
Step 484/1000, Loss: 0.01073874719440937, LR: 0.0005
Step 485/1000, Loss: 0.01069933082908392, LR: 0.0005
Step 486/1000, Loss: 0.010660158470273018, LR: 0.0005
Step 487/1000, Loss: 0.010621210560202599, LR: 0.0005
Step 488/1000, Loss: 0.01058250479400158, LR: 0.0005
Step 489/1000, Loss: 0.010544024407863617, LR: 0.0005
Step 490/1000, Loss: 0.010505775921046734, LR: 0.0005
Step 491/1000, Loss: 0.01046774536371231, LR: 0.0005
Step 492/1000, Loss: 0.01042993925511837, LR: 0.0005
Step 493/1000, Loss: 0.01039235107600689, LR: 0.0005
Step 494/1000, Loss: 0.010354990139603615, LR: 0.0005
Step 495/1000, Loss: 0.010317839682102203, LR: 0.0005
Step 496/1000, Loss: 0.010280909948050976, LR: 0.0005
Step 497/1000, Loss: 0.01024419255554676, LR: 0.0005
Step 498/1000, Loss: 0.01020768377929926, LR: 0.0005
Step 499/1000, Loss: 0.010171398520469666, LR: 0.0005
Step 500/1000, Loss: 0.010135313495993614, LR: 0.0005
Step 501/1000, Loss: 0.010099424049258232, LR: 0.0005
Step 502/1000, Loss: 0.010063758119940758, LR: 0.0005
Step 503/1000, Loss: 0.010028297081589699, LR: 0.0005
Step 504/1000, Loss: 0.00999302975833416, LR: 0.0005
Step 505/1000, Loss: 0.009957965463399887, LR: 0.0005
Step 506/1000, Loss: 0.009923097677528858, LR: 0.0005
Step 507/1000, Loss: 0.009888437576591969, LR: 0.0005
Step 508/1000, Loss: 0.00985395722091198, LR: 0.0005
Step 509/1000, Loss: 0.009819683618843555, LR: 0.0005
Step 510/1000, Loss: 0.00978559534996748, LR: 0.0005
Step 511/1000, Loss: 0.009751707315444946, LR: 0.0005
Step 512/1000, Loss: 0.009718005545437336, LR: 0.0005
Step 513/1000, Loss: 0.009684493765234947, LR: 0.0005
Step 514/1000, Loss: 0.009651164524257183, LR: 0.0005
Step 515/1000, Loss: 0.009618023410439491, LR: 0.0005
Step 516/1000, Loss: 0.009585065767168999, LR: 0.0005
Step 517/1000, Loss: 0.009552285075187683, LR: 0.0005
Step 518/1000, Loss: 0.009519694373011589, LR: 0.0005
Step 519/1000, Loss: 0.009487281553447247, LR: 0.0005
Step 520/1000, Loss: 0.009455042891204357, LR: 0.0005
Step 521/1000, Loss: 0.00942298211157322, LR: 0.0005
Step 522/1000, Loss: 0.009391089901328087, LR: 0.0005
Step 523/1000, Loss: 0.009359384886920452, LR: 0.0005
Step 524/1000, Loss: 0.0093278419226408, LR: 0.0005
Step 525/1000, Loss: 0.009296474978327751, LR: 0.0005
Step 526/1000, Loss: 0.00926528126001358, LR: 0.0005
Step 527/1000, Loss: 0.009234251454472542, LR: 0.0005
Step 528/1000, Loss: 0.009203390218317509, LR: 0.0005
Step 529/1000, Loss: 0.009172694757580757, LR: 0.0005
Step 530/1000, Loss: 0.009142170660197735, LR: 0.0005
Step 531/1000, Loss: 0.00911181140691042, LR: 0.0005
Step 532/1000, Loss: 0.009081612341105938, LR: 0.0005
Step 533/1000, Loss: 0.009051559492945671, LR: 0.0005
Step 534/1000, Loss: 0.009021690115332603, LR: 0.0005
Step 535/1000, Loss: 0.008991964161396027, LR: 0.0005
Step 536/1000, Loss: 0.00896239373832941, LR: 0.0005
Step 537/1000, Loss: 0.008932976052165031, LR: 0.0005
Step 538/1000, Loss: 0.008903736248612404, LR: 0.0005
Step 539/1000, Loss: 0.008874637074768543, LR: 0.0005
Step 540/1000, Loss: 0.008845696225762367, LR: 0.0005
Step 541/1000, Loss: 0.008816898800432682, LR: 0.0005
Step 542/1000, Loss: 0.008788255043327808, LR: 0.0005
Step 543/1000, Loss: 0.008759756572544575, LR: 0.0005
Step 544/1000, Loss: 0.008731414563953876, LR: 0.0005
Step 545/1000, Loss: 0.008703215047717094, LR: 0.0005
Step 546/1000, Loss: 0.0086751664057374, LR: 0.0005
Step 547/1000, Loss: 0.008647261187434196, LR: 0.0005
Step 548/1000, Loss: 0.008619503118097782, LR: 0.0005
Step 549/1000, Loss: 0.00859188474714756, LR: 0.0005
Step 550/1000, Loss: 0.008564407005906105, LR: 0.0005
Step 551/1000, Loss: 0.008537070825695992, LR: 0.0005
Step 552/1000, Loss: 0.00850987620651722, LR: 0.0005
Step 553/1000, Loss: 0.008482824079692364, LR: 0.0005
Step 554/1000, Loss: 0.008455904200673103, LR: 0.0005
Step 555/1000, Loss: 0.008429121226072311, LR: 0.0005
Step 556/1000, Loss: 0.008402475155889988, LR: 0.0005
Step 557/1000, Loss: 0.008375967852771282, LR: 0.0005
Step 558/1000, Loss: 0.008349591866135597, LR: 0.0005
Step 559/1000, Loss: 0.008323350921273232, LR: 0.0005
Step 560/1000, Loss: 0.008297242224216461, LR: 0.0005
Step 561/1000, Loss: 0.008271262049674988, LR: 0.0005
Step 562/1000, Loss: 0.008245415054261684, LR: 0.0005
Step 563/1000, Loss: 0.008219690062105656, LR: 0.0005
Step 564/1000, Loss: 0.008194111287593842, LR: 0.0005
Step 565/1000, Loss: 0.008168643340468407, LR: 0.0005
Step 566/1000, Loss: 0.008143315091729164, LR: 0.0005
Step 567/1000, Loss: 0.008118102326989174, LR: 0.0005
Step 568/1000, Loss: 0.00809301808476448, LR: 0.0005
Step 569/1000, Loss: 0.008068053051829338, LR: 0.0005
Step 570/1000, Loss: 0.00804322212934494, LR: 0.0005
Step 571/1000, Loss: 0.008018498308956623, LR: 0.0005
Step 572/1000, Loss: 0.007993906736373901, LR: 0.0005
Step 573/1000, Loss: 0.007969443686306477, LR: 0.0005
Step 574/1000, Loss: 0.007945085875689983, LR: 0.0005
Step 575/1000, Loss: 0.007920843549072742, LR: 0.0005
Step 576/1000, Loss: 0.007896732538938522, LR: 0.0005
Step 577/1000, Loss: 0.00787273421883583, LR: 0.0005
Step 578/1000, Loss: 0.007848854176700115, LR: 0.0005
Step 579/1000, Loss: 0.007825089618563652, LR: 0.0005
Step 580/1000, Loss: 0.007801445666700602, LR: 0.0005
Step 581/1000, Loss: 0.007777903228998184, LR: 0.0005
Step 582/1000, Loss: 0.007754480931907892, LR: 0.0005
Step 583/1000, Loss: 0.007731170859187841, LR: 0.0005
Step 584/1000, Loss: 0.007707974873483181, LR: 0.0005
Step 585/1000, Loss: 0.007684876210987568, LR: 0.0005
Step 586/1000, Loss: 0.007661907933652401, LR: 0.0005
Step 587/1000, Loss: 0.0076390416361391544, LR: 0.0005
Step 588/1000, Loss: 0.0076162852346897125, LR: 0.0005
Step 589/1000, Loss: 0.0075936331413686275, LR: 0.0005
Step 590/1000, Loss: 0.007571092341095209, LR: 0.0005
Step 591/1000, Loss: 0.007548660039901733, LR: 0.0005
Step 592/1000, Loss: 0.007526332046836615, LR: 0.0005
Step 593/1000, Loss: 0.0075041004456579685, LR: 0.0005
Step 594/1000, Loss: 0.007481986191123724, LR: 0.0005
Step 595/1000, Loss: 0.007459969259798527, LR: 0.0005
Step 596/1000, Loss: 0.0074380626901984215, LR: 0.0005
Step 597/1000, Loss: 0.007416248321533203, LR: 0.0005
Step 598/1000, Loss: 0.007394538726657629, LR: 0.0005
Step 599/1000, Loss: 0.0073729343712329865, LR: 0.0005
Step 600/1000, Loss: 0.007351431995630264, LR: 0.0005
Step 601/1000, Loss: 0.007330027408897877, LR: 0.0005
Step 602/1000, Loss: 0.0073087154887616634, LR: 0.0005
Step 603/1000, Loss: 0.007287511136382818, LR: 0.0005
Step 604/1000, Loss: 0.007266400847584009, LR: 0.0005
Step 605/1000, Loss: 0.007245390675961971, LR: 0.0005
Step 606/1000, Loss: 0.007224473170936108, LR: 0.0005
Step 607/1000, Loss: 0.007203653454780579, LR: 0.0005
Step 608/1000, Loss: 0.0071829394437372684, LR: 0.0005
Step 609/1000, Loss: 0.0071622999384999275, LR: 0.0005
Step 610/1000, Loss: 0.007141781039535999, LR: 0.0005
Step 611/1000, Loss: 0.007121329661458731, LR: 0.0005
Step 612/1000, Loss: 0.007100989110767841, LR: 0.0005
Step 613/1000, Loss: 0.007080732379108667, LR: 0.0005
Step 614/1000, Loss: 0.00706056784838438, LR: 0.0005
Step 615/1000, Loss: 0.00704049551859498, LR: 0.0005
Step 616/1000, Loss: 0.0070205144584178925, LR: 0.0005
Step 617/1000, Loss: 0.0070006223395466805, LR: 0.0005
Step 618/1000, Loss: 0.006980816833674908, LR: 0.0005
Step 619/1000, Loss: 0.006961110047996044, LR: 0.0005
Step 620/1000, Loss: 0.006941494531929493, LR: 0.0005
Step 621/1000, Loss: 0.006921953521668911, LR: 0.0005
Step 622/1000, Loss: 0.0069025056436657906, LR: 0.0005
Step 623/1000, Loss: 0.0068831490352749825, LR: 0.0005
Step 624/1000, Loss: 0.006863863673061132, LR: 0.0005
Step 625/1000, Loss: 0.00684467563405633, LR: 0.0005
Step 626/1000, Loss: 0.006825575139373541, LR: 0.0005
Step 627/1000, Loss: 0.006806555204093456, LR: 0.0005
Step 628/1000, Loss: 0.006787620484828949, LR: 0.0005
Step 629/1000, Loss: 0.006768770515918732, LR: 0.0005
Step 630/1000, Loss: 0.006750001106411219, LR: 0.0005
Step 631/1000, Loss: 0.006731316447257996, LR: 0.0005
Step 632/1000, Loss: 0.006712707225233316, LR: 0.0005
Step 633/1000, Loss: 0.006694186478853226, LR: 0.0005
Step 634/1000, Loss: 0.006675742566585541, LR: 0.0005
Step 635/1000, Loss: 0.006657374557107687, LR: 0.0005
Step 636/1000, Loss: 0.006639101542532444, LR: 0.0005
Step 637/1000, Loss: 0.006620905362069607, LR: 0.0005
Step 638/1000, Loss: 0.006602780427783728, LR: 0.0005
Step 639/1000, Loss: 0.006584735121577978, LR: 0.0005
Step 640/1000, Loss: 0.006566768046468496, LR: 0.0005
Step 641/1000, Loss: 0.006548880599439144, LR: 0.0005
Step 642/1000, Loss: 0.006531073246151209, LR: 0.0005
Step 643/1000, Loss: 0.00651333574205637, LR: 0.0005
Step 644/1000, Loss: 0.006495679263025522, LR: 0.0005
Step 645/1000, Loss: 0.006478097289800644, LR: 0.0005
Step 646/1000, Loss: 0.006460592150688171, LR: 0.0005
Step 647/1000, Loss: 0.006443151738494635, LR: 0.0005
Step 648/1000, Loss: 0.006425797939300537, LR: 0.0005
Step 649/1000, Loss: 0.00640850979834795, LR: 0.0005
Step 650/1000, Loss: 0.00639130175113678, LR: 0.0005
Step 651/1000, Loss: 0.0063741737976670265, LR: 0.0005
Step 652/1000, Loss: 0.006357112433761358, LR: 0.0005
Step 653/1000, Loss: 0.0063401153311133385, LR: 0.0005
Step 654/1000, Loss: 0.006323198787868023, LR: 0.0005
Step 655/1000, Loss: 0.006306345574557781, LR: 0.0005
Step 656/1000, Loss: 0.00628957012668252, LR: 0.0005
Step 657/1000, Loss: 0.006272861268371344, LR: 0.0005
Step 658/1000, Loss: 0.006256225518882275, LR: 0.0005
Step 659/1000, Loss: 0.006239654030650854, LR: 0.0005
Step 660/1000, Loss: 0.006223163101822138, LR: 0.0005
Step 661/1000, Loss: 0.006206731777638197, LR: 0.0005
Step 662/1000, Loss: 0.006190373562276363, LR: 0.0005
Step 663/1000, Loss: 0.006174078676849604, LR: 0.0005
Step 664/1000, Loss: 0.006157855037599802, LR: 0.0005
Step 665/1000, Loss: 0.006141699850559235, LR: 0.0005
Step 666/1000, Loss: 0.0061256168410182, LR: 0.0005
Step 667/1000, Loss: 0.006109591107815504, LR: 0.0005
Step 668/1000, Loss: 0.0060936277732253075, LR: 0.0005
Step 669/1000, Loss: 0.006077745463699102, LR: 0.0005
Step 670/1000, Loss: 0.006061912514269352, LR: 0.0005
Step 671/1000, Loss: 0.006046155001968145, LR: 0.0005
Step 672/1000, Loss: 0.006030459888279438, LR: 0.0005
Step 673/1000, Loss: 0.006014828570187092, LR: 0.0005
Step 674/1000, Loss: 0.005999256391078234, LR: 0.0005
Step 675/1000, Loss: 0.005983751732856035, LR: 0.0005
Step 676/1000, Loss: 0.0059683071449398994, LR: 0.0005
Step 677/1000, Loss: 0.005952931009232998, LR: 0.0005
Step 678/1000, Loss: 0.005937611218541861, LR: 0.0005
Step 679/1000, Loss: 0.005922363139688969, LR: 0.0005
Step 680/1000, Loss: 0.005907172337174416, LR: 0.0005
Step 681/1000, Loss: 0.005892039742320776, LR: 0.0005
Step 682/1000, Loss: 0.005876975599676371, LR: 0.0005
Step 683/1000, Loss: 0.005861960351467133, LR: 0.0005
Step 684/1000, Loss: 0.005847008898854256, LR: 0.0005
Step 685/1000, Loss: 0.0058321263641119, LR: 0.0005
Step 686/1000, Loss: 0.005817302968353033, LR: 0.0005
Step 687/1000, Loss: 0.005802526138722897, LR: 0.0005
Step 688/1000, Loss: 0.005787815898656845, LR: 0.0005
Step 689/1000, Loss: 0.0057731689885258675, LR: 0.0005
Step 690/1000, Loss: 0.005758573766797781, LR: 0.0005
Step 691/1000, Loss: 0.005744033958762884, LR: 0.0005
Step 692/1000, Loss: 0.005729551427066326, LR: 0.0005
Step 693/1000, Loss: 0.0057151298969984055, LR: 0.0005
Step 694/1000, Loss: 0.0057007670402526855, LR: 0.0005
Step 695/1000, Loss: 0.00568646565079689, LR: 0.0005
Step 696/1000, Loss: 0.00567221874371171, LR: 0.0005
Step 697/1000, Loss: 0.005658020731061697, LR: 0.0005
Step 698/1000, Loss: 0.005643889773637056, LR: 0.0005
Step 699/1000, Loss: 0.0056297993287444115, LR: 0.0005
Step 700/1000, Loss: 0.005615774542093277, LR: 0.0005
Step 701/1000, Loss: 0.005601801909506321, LR: 0.0005
Step 702/1000, Loss: 0.0055878860875964165, LR: 0.0005
Step 703/1000, Loss: 0.005574020557105541, LR: 0.0005
Step 704/1000, Loss: 0.005560219753533602, LR: 0.0005
Step 705/1000, Loss: 0.0055464571341872215, LR: 0.0005
Step 706/1000, Loss: 0.005532749928534031, LR: 0.0005
Step 707/1000, Loss: 0.005519102327525616, LR: 0.0005
Step 708/1000, Loss: 0.005505514796823263, LR: 0.0005
Step 709/1000, Loss: 0.005491972900927067, LR: 0.0005
Step 710/1000, Loss: 0.005478481762111187, LR: 0.0005
Step 711/1000, Loss: 0.005465040914714336, LR: 0.0005
Step 712/1000, Loss: 0.005451650358736515, LR: 0.0005
Step 713/1000, Loss: 0.00543831754475832, LR: 0.0005
Step 714/1000, Loss: 0.005425030831247568, LR: 0.0005
Step 715/1000, Loss: 0.005411794874817133, LR: 0.0005
Step 716/1000, Loss: 0.0053986175917088985, LR: 0.0005
Step 717/1000, Loss: 0.005385482218116522, LR: 0.0005
Step 718/1000, Loss: 0.005372397601604462, LR: 0.0005
Step 719/1000, Loss: 0.00535936513915658, LR: 0.0005
Step 720/1000, Loss: 0.005346388556063175, LR: 0.0005
Step 721/1000, Loss: 0.005333453416824341, LR: 0.0005
Step 722/1000, Loss: 0.0053205667063593864, LR: 0.0005
Step 723/1000, Loss: 0.0053077321499586105, LR: 0.0005
Step 724/1000, Loss: 0.0052949427627027035, LR: 0.0005
Step 725/1000, Loss: 0.005282202735543251, LR: 0.0005
Step 726/1000, Loss: 0.005269509740173817, LR: 0.0005
Step 727/1000, Loss: 0.005256865173578262, LR: 0.0005
Step 728/1000, Loss: 0.005244269035756588, LR: 0.0005
Step 729/1000, Loss: 0.005231723189353943, LR: 0.0005
Step 730/1000, Loss: 0.005219218786805868, LR: 0.0005
Step 731/1000, Loss: 0.005206763744354248, LR: 0.0005
Step 732/1000, Loss: 0.005194360390305519, LR: 0.0005
Step 733/1000, Loss: 0.005181995220482349, LR: 0.0005
Step 734/1000, Loss: 0.00516967847943306, LR: 0.0005
Step 735/1000, Loss: 0.005157409701496363, LR: 0.0005
Step 736/1000, Loss: 0.0051451874896883965, LR: 0.0005
Step 737/1000, Loss: 0.005133005324751139, LR: 0.0005
Step 738/1000, Loss: 0.00512087345123291, LR: 0.0005
Step 739/1000, Loss: 0.005108783021569252, LR: 0.0005
Step 740/1000, Loss: 0.005096740555018187, LR: 0.0005
Step 741/1000, Loss: 0.005084737669676542, LR: 0.0005
Step 742/1000, Loss: 0.005072786472737789, LR: 0.0005
Step 743/1000, Loss: 0.005060871597379446, LR: 0.0005
Step 744/1000, Loss: 0.0050490060821175575, LR: 0.0005
Step 745/1000, Loss: 0.005037182010710239, LR: 0.0005
Step 746/1000, Loss: 0.005025395657867193, LR: 0.0005
Step 747/1000, Loss: 0.005013664253056049, LR: 0.0005
Step 748/1000, Loss: 0.005001957528293133, LR: 0.0005
Step 749/1000, Loss: 0.004990304354578257, LR: 0.0005
Step 750/1000, Loss: 0.004978703800588846, LR: 0.0005
Step 751/1000, Loss: 0.00496713537722826, LR: 0.0005
Step 752/1000, Loss: 0.004955610726028681, LR: 0.0005
Step 753/1000, Loss: 0.004944118205457926, LR: 0.0005
Step 754/1000, Loss: 0.004932672716677189, LR: 0.0005
Step 755/1000, Loss: 0.0049212719313800335, LR: 0.0005
Step 756/1000, Loss: 0.004909916780889034, LR: 0.0005
Step 757/1000, Loss: 0.004898594226688147, LR: 0.0005
Step 758/1000, Loss: 0.004887314047664404, LR: 0.0005
Step 759/1000, Loss: 0.004876072518527508, LR: 0.0005
Step 760/1000, Loss: 0.00486487802118063, LR: 0.0005
Step 761/1000, Loss: 0.004853720776736736, LR: 0.0005
Step 762/1000, Loss: 0.004842604510486126, LR: 0.0005
Step 763/1000, Loss: 0.004831527825444937, LR: 0.0005
Step 764/1000, Loss: 0.004820483736693859, LR: 0.0005
Step 765/1000, Loss: 0.0048094866797327995, LR: 0.0005
Step 766/1000, Loss: 0.004798525478690863, LR: 0.0005
Step 767/1000, Loss: 0.004787603858858347, LR: 0.0005
Step 768/1000, Loss: 0.004776715766638517, LR: 0.0005
Step 769/1000, Loss: 0.004765878431499004, LR: 0.0005
Step 770/1000, Loss: 0.004755067173391581, LR: 0.0005
Step 771/1000, Loss: 0.004744302947074175, LR: 0.0005
Step 772/1000, Loss: 0.004733569920063019, LR: 0.0005
Step 773/1000, Loss: 0.004722872748970985, LR: 0.0005
Step 774/1000, Loss: 0.004712224937975407, LR: 0.0005
Step 775/1000, Loss: 0.004701610654592514, LR: 0.0005
Step 776/1000, Loss: 0.00469102943316102, LR: 0.0005
Step 777/1000, Loss: 0.004680488258600235, LR: 0.0005
Step 778/1000, Loss: 0.004669980611652136, LR: 0.0005
Step 779/1000, Loss: 0.00465951394289732, LR: 0.0005
Step 780/1000, Loss: 0.004649073351174593, LR: 0.0005
Step 781/1000, Loss: 0.00463868398219347, LR: 0.0005
Step 782/1000, Loss: 0.004628323018550873, LR: 0.0005
Step 783/1000, Loss: 0.004617996513843536, LR: 0.0005
Step 784/1000, Loss: 0.004607713781297207, LR: 0.0005
Step 785/1000, Loss: 0.004597459454089403, LR: 0.0005
Step 786/1000, Loss: 0.004587238188832998, LR: 0.0005
Step 787/1000, Loss: 0.004577059298753738, LR: 0.0005
Step 788/1000, Loss: 0.004566916264593601, LR: 0.0005
Step 789/1000, Loss: 0.0045567965134978294, LR: 0.0005
Step 790/1000, Loss: 0.004546718206256628, LR: 0.0005
Step 791/1000, Loss: 0.004536682274192572, LR: 0.0005
Step 792/1000, Loss: 0.0045266770757734776, LR: 0.0005
Step 793/1000, Loss: 0.004516695160418749, LR: 0.0005
Step 794/1000, Loss: 0.004506758414208889, LR: 0.0005
Step 795/1000, Loss: 0.004496856592595577, LR: 0.0005
Step 796/1000, Loss: 0.004486984573304653, LR: 0.0005
Step 797/1000, Loss: 0.004477140959352255, LR: 0.0005
Step 798/1000, Loss: 0.004467337392270565, LR: 0.0005
Step 799/1000, Loss: 0.004457566887140274, LR: 0.0005
Step 800/1000, Loss: 0.004447835963219404, LR: 0.0005
Step 801/1000, Loss: 0.004438124597072601, LR: 0.0005
Step 802/1000, Loss: 0.004428455606102943, LR: 0.0005
Step 803/1000, Loss: 0.0044188122265040874, LR: 0.0005
Step 804/1000, Loss: 0.004409208428114653, LR: 0.0005
Step 805/1000, Loss: 0.004399636760354042, LR: 0.0005
Step 806/1000, Loss: 0.004390091635286808, LR: 0.0005
Step 807/1000, Loss: 0.00438058003783226, LR: 0.0005
Step 808/1000, Loss: 0.0043711126782000065, LR: 0.0005
Step 809/1000, Loss: 0.004361661616712809, LR: 0.0005
Step 810/1000, Loss: 0.004352245479822159, LR: 0.0005
Step 811/1000, Loss: 0.004342859145253897, LR: 0.0005
Step 812/1000, Loss: 0.004333512391895056, LR: 0.0005
Step 813/1000, Loss: 0.0043241968378424644, LR: 0.0005
Step 814/1000, Loss: 0.004314909223467112, LR: 0.0005
Step 815/1000, Loss: 0.004305648151785135, LR: 0.0005
Step 816/1000, Loss: 0.004296420607715845, LR: 0.0005
Step 817/1000, Loss: 0.004287217743694782, LR: 0.0005
Step 818/1000, Loss: 0.004278055392205715, LR: 0.0005
Step 819/1000, Loss: 0.0042689209803938866, LR: 0.0005
Step 820/1000, Loss: 0.004259810782968998, LR: 0.0005
Step 821/1000, Loss: 0.004250743426382542, LR: 0.0005
Step 822/1000, Loss: 0.004241696558892727, LR: 0.0005
Step 823/1000, Loss: 0.00423267250880599, LR: 0.0005
Step 824/1000, Loss: 0.004223690368235111, LR: 0.0005
Step 825/1000, Loss: 0.00421473104506731, LR: 0.0005
Step 826/1000, Loss: 0.0042058005928993225, LR: 0.0005
Step 827/1000, Loss: 0.0041969045996665955, LR: 0.0005
Step 828/1000, Loss: 0.004188031889498234, LR: 0.0005
Step 829/1000, Loss: 0.004179197363555431, LR: 0.0005
Step 830/1000, Loss: 0.004170377738773823, LR: 0.0005
Step 831/1000, Loss: 0.004161602817475796, LR: 0.0005
Step 832/1000, Loss: 0.00415284838527441, LR: 0.0005
Step 833/1000, Loss: 0.004144120030105114, LR: 0.0005
Step 834/1000, Loss: 0.004135423339903355, LR: 0.0005
Step 835/1000, Loss: 0.0041267527267336845, LR: 0.0005
Step 836/1000, Loss: 0.004118111915886402, LR: 0.0005
Step 837/1000, Loss: 0.004109499976038933, LR: 0.0005
Step 838/1000, Loss: 0.004100920632481575, LR: 0.0005
Step 839/1000, Loss: 0.004092366900295019, LR: 0.0005
Step 840/1000, Loss: 0.004083829931914806, LR: 0.0005
Step 841/1000, Loss: 0.004075332544744015, LR: 0.0005
Step 842/1000, Loss: 0.0040668584406375885, LR: 0.0005
Step 843/1000, Loss: 0.004058408550918102, LR: 0.0005
Step 844/1000, Loss: 0.004049979615956545, LR: 0.0005
Step 845/1000, Loss: 0.004041589796543121, LR: 0.0005
Step 846/1000, Loss: 0.0040332237258553505, LR: 0.0005
Step 847/1000, Loss: 0.004024882800877094, LR: 0.0005
Step 848/1000, Loss: 0.004016571678221226, LR: 0.0005
Step 849/1000, Loss: 0.0040082866325974464, LR: 0.0005
Step 850/1000, Loss: 0.004000022076070309, LR: 0.0005
Step 851/1000, Loss: 0.003991791047155857, LR: 0.0005
Step 852/1000, Loss: 0.003983574919402599, LR: 0.0005
Step 853/1000, Loss: 0.003975397441536188, LR: 0.0005
Step 854/1000, Loss: 0.003967237658798695, LR: 0.0005
Step 855/1000, Loss: 0.003959109075367451, LR: 0.0005
Step 856/1000, Loss: 0.003951003309339285, LR: 0.0005
Step 857/1000, Loss: 0.00394292501732707, LR: 0.0005
Step 858/1000, Loss: 0.003934868611395359, LR: 0.0005
Step 859/1000, Loss: 0.003926844336092472, LR: 0.0005
Step 860/1000, Loss: 0.003918840084224939, LR: 0.0005
Step 861/1000, Loss: 0.003910858649760485, LR: 0.0005
Step 862/1000, Loss: 0.003902912838384509, LR: 0.0005
Step 863/1000, Loss: 0.0038949833251535892, LR: 0.0005
Step 864/1000, Loss: 0.0038870838470757008, LR: 0.0005
Step 865/1000, Loss: 0.0038792050909250975, LR: 0.0005
Step 866/1000, Loss: 0.0038713521789759398, LR: 0.0005
Step 867/1000, Loss: 0.0038635204546153545, LR: 0.0005
Step 868/1000, Loss: 0.003855716437101364, LR: 0.0005
Step 869/1000, Loss: 0.003847935702651739, LR: 0.0005
Step 870/1000, Loss: 0.0038401789497584105, LR: 0.0005
Step 871/1000, Loss: 0.003832441521808505, LR: 0.0005
Step 872/1000, Loss: 0.0038247413467615843, LR: 0.0005
Step 873/1000, Loss: 0.0038170539774000645, LR: 0.0005
Step 874/1000, Loss: 0.0038093936163932085, LR: 0.0005
Step 875/1000, Loss: 0.0038017607294023037, LR: 0.0005
Step 876/1000, Loss: 0.003794149961322546, LR: 0.0005
Step 877/1000, Loss: 0.003786554094403982, LR: 0.0005
Step 878/1000, Loss: 0.0037789822090417147, LR: 0.0005
Step 879/1000, Loss: 0.00377144617959857, LR: 0.0005
Step 880/1000, Loss: 0.003763922257348895, LR: 0.0005
Step 881/1000, Loss: 0.0037564232479780912, LR: 0.0005
Step 882/1000, Loss: 0.003748946590349078, LR: 0.0005
Step 883/1000, Loss: 0.0037414957769215107, LR: 0.0005
Step 884/1000, Loss: 0.003734070807695389, LR: 0.0005
Step 885/1000, Loss: 0.0037266656290739775, LR: 0.0005
Step 886/1000, Loss: 0.003719282103702426, LR: 0.0005
Step 887/1000, Loss: 0.0037119188345968723, LR: 0.0005
Step 888/1000, Loss: 0.003704572794958949, LR: 0.0005
Step 889/1000, Loss: 0.0036972600501030684, LR: 0.0005
Step 890/1000, Loss: 0.0036899708211421967, LR: 0.0005
Step 891/1000, Loss: 0.0036826960276812315, LR: 0.0005
Step 892/1000, Loss: 0.003675443585962057, LR: 0.0005
Step 893/1000, Loss: 0.0036682181525975466, LR: 0.0005
Step 894/1000, Loss: 0.0036610171664506197, LR: 0.0005
Step 895/1000, Loss: 0.0036538317799568176, LR: 0.0005
Step 896/1000, Loss: 0.003646660130470991, LR: 0.0005
Step 897/1000, Loss: 0.003639526665210724, LR: 0.0005
Step 898/1000, Loss: 0.0036323971580713987, LR: 0.0005
Step 899/1000, Loss: 0.003625305835157633, LR: 0.0005
Step 900/1000, Loss: 0.003618230577558279, LR: 0.0005
Step 901/1000, Loss: 0.0036111711524426937, LR: 0.0005
Step 902/1000, Loss: 0.0036041378043591976, LR: 0.0005
Step 903/1000, Loss: 0.003597124246880412, LR: 0.0005
Step 904/1000, Loss: 0.003590137232095003, LR: 0.0005
Step 905/1000, Loss: 0.003583161626011133, LR: 0.0005
Step 906/1000, Loss: 0.0035762074403464794, LR: 0.0005
Step 907/1000, Loss: 0.0035692814271897078, LR: 0.0005
Step 908/1000, Loss: 0.00356237031519413, LR: 0.0005
Step 909/1000, Loss: 0.0035554796922951937, LR: 0.0005
Step 910/1000, Loss: 0.003548609558492899, LR: 0.0005
Step 911/1000, Loss: 0.003541758982464671, LR: 0.0005
Step 912/1000, Loss: 0.0035349340178072453, LR: 0.0005
Step 913/1000, Loss: 0.0035281225573271513, LR: 0.0005
Step 914/1000, Loss: 0.0035213276278227568, LR: 0.0005
Step 915/1000, Loss: 0.0035145606379956007, LR: 0.0005
Step 916/1000, Loss: 0.0035078185610473156, LR: 0.0005
Step 917/1000, Loss: 0.003501088824123144, LR: 0.0005
Step 918/1000, Loss: 0.0034943842329084873, LR: 0.0005
Step 919/1000, Loss: 0.0034876905847340822, LR: 0.0005
Step 920/1000, Loss: 0.003481017891317606, LR: 0.0005
Step 921/1000, Loss: 0.003474373370409012, LR: 0.0005
Step 922/1000, Loss: 0.0034677416551858187, LR: 0.0005
Step 923/1000, Loss: 0.00346113252453506, LR: 0.0005
Step 924/1000, Loss: 0.0034545338712632656, LR: 0.0005
Step 925/1000, Loss: 0.0034479654859751463, LR: 0.0005
Step 926/1000, Loss: 0.003441412001848221, LR: 0.0005
Step 927/1000, Loss: 0.003434880869463086, LR: 0.0005
Step 928/1000, Loss: 0.0034283646382391453, LR: 0.0005
Step 929/1000, Loss: 0.00342186470516026, LR: 0.0005
Step 930/1000, Loss: 0.003415388520807028, LR: 0.0005
Step 931/1000, Loss: 0.003408926073461771, LR: 0.0005
Step 932/1000, Loss: 0.0034024883061647415, LR: 0.0005
Step 933/1000, Loss: 0.0033960696309804916, LR: 0.0005
Step 934/1000, Loss: 0.0033896665554493666, LR: 0.0005
Step 935/1000, Loss: 0.0033832855988293886, LR: 0.0005
Step 936/1000, Loss: 0.003376914653927088, LR: 0.0005
Step 937/1000, Loss: 0.0033705648966133595, LR: 0.0005
Step 938/1000, Loss: 0.003364233300089836, LR: 0.0005
Step 939/1000, Loss: 0.003357923123985529, LR: 0.0005
Step 940/1000, Loss: 0.003351630177348852, LR: 0.0005
Step 941/1000, Loss: 0.003345354925841093, LR: 0.0005
Step 942/1000, Loss: 0.003339096438139677, LR: 0.0005
Step 943/1000, Loss: 0.0033328577410429716, LR: 0.0005
Step 944/1000, Loss: 0.0033266418613493443, LR: 0.0005
Step 945/1000, Loss: 0.0033204425126314163, LR: 0.0005
Step 946/1000, Loss: 0.0033142506144940853, LR: 0.0005
Step 947/1000, Loss: 0.0033080782741308212, LR: 0.0005
Step 948/1000, Loss: 0.003301924094557762, LR: 0.0005
Step 949/1000, Loss: 0.0032957918010652065, LR: 0.0005
Step 950/1000, Loss: 0.003289679531008005, LR: 0.0005
Step 951/1000, Loss: 0.003283575875684619, LR: 0.0005
Step 952/1000, Loss: 0.0032774941064417362, LR: 0.0005
Step 953/1000, Loss: 0.0032714330591261387, LR: 0.0005
Step 954/1000, Loss: 0.0032653820235282183, LR: 0.0005
Step 955/1000, Loss: 0.0032593528740108013, LR: 0.0005
Step 956/1000, Loss: 0.0032533383928239346, LR: 0.0005
Step 957/1000, Loss: 0.003247339278459549, LR: 0.0005
Step 958/1000, Loss: 0.003241369966417551, LR: 0.0005
Step 959/1000, Loss: 0.003235395299270749, LR: 0.0005
Step 960/1000, Loss: 0.0032294518314301968, LR: 0.0005
Step 961/1000, Loss: 0.0032235246617347, LR: 0.0005
Step 962/1000, Loss: 0.003217612160369754, LR: 0.0005
Step 963/1000, Loss: 0.003211710602045059, LR: 0.0005
Step 964/1000, Loss: 0.003205829067155719, LR: 0.0005
Step 965/1000, Loss: 0.003199969185516238, LR: 0.0005
Step 966/1000, Loss: 0.003194121178239584, LR: 0.0005
Step 967/1000, Loss: 0.00318829040043056, LR: 0.0005
Step 968/1000, Loss: 0.0031824796460568905, LR: 0.0005
Step 969/1000, Loss: 0.0031766779720783234, LR: 0.0005
Step 970/1000, Loss: 0.003170899348333478, LR: 0.0005
Step 971/1000, Loss: 0.0031651295721530914, LR: 0.0005
Step 972/1000, Loss: 0.0031593807507306337, LR: 0.0005
Step 973/1000, Loss: 0.0031536526512354612, LR: 0.0005
Step 974/1000, Loss: 0.0031479306053370237, LR: 0.0005
Step 975/1000, Loss: 0.0031422239262610674, LR: 0.0005
Step 976/1000, Loss: 0.0031365402974188328, LR: 0.0005
Step 977/1000, Loss: 0.0031308692414313555, LR: 0.0005
Step 978/1000, Loss: 0.0031252100598067045, LR: 0.0005
Step 979/1000, Loss: 0.003119572065770626, LR: 0.0005
Step 980/1000, Loss: 0.003113951999694109, LR: 0.0005
Step 981/1000, Loss: 0.0031083361245691776, LR: 0.0005
Step 982/1000, Loss: 0.00310274469666183, LR: 0.0005
Step 983/1000, Loss: 0.0030971714295446873, LR: 0.0005
Step 984/1000, Loss: 0.003091600490733981, LR: 0.0005
Step 985/1000, Loss: 0.0030860528349876404, LR: 0.0005
Step 986/1000, Loss: 0.0030805172864347696, LR: 0.0005
Step 987/1000, Loss: 0.0030750036239624023, LR: 0.0005
Step 988/1000, Loss: 0.0030694985762238503, LR: 0.0005
Step 989/1000, Loss: 0.0030640107579529285, LR: 0.0005
Step 990/1000, Loss: 0.0030585364438593388, LR: 0.0005
Step 991/1000, Loss: 0.0030530826188623905, LR: 0.0005
Step 992/1000, Loss: 0.0030476334504783154, LR: 0.0005
Step 993/1000, Loss: 0.003042203839868307, LR: 0.0005
Step 994/1000, Loss: 0.0030367891304194927, LR: 0.0005
Step 995/1000, Loss: 0.0030313883908092976, LR: 0.0005
Step 996/1000, Loss: 0.003025999991223216, LR: 0.0005
Step 997/1000, Loss: 0.003020629985257983, LR: 0.0005
Step 998/1000, Loss: 0.003015281166881323, LR: 0.0005
Step 999/1000, Loss: 0.003009930718690157, LR: 0.0005
Step 1000/1000, Loss: 0.003004607977345586, LR: 0.0005
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x3315e5c50>]
Image Output

1.3.8 Test Two: Memorization

To perform inference, we can autoregressively feed data into the transformer, sliding the selected output token back into the input. We can test this on one of our training examples and see that our model is accurately reproducing the training example. The model has been overfit to the data, so we are testing if the model reproduces the correct outputs in the same order as the inputs.

def inference(prompt, max_new_tokens):
    tokens = tokenizer.encode(prompt)
    for _ in range(max_new_tokens):
        num_tokens = len(tokens)
        tokens_padded = tokens + [tokenizer.eot_token] * (config.seq_len - num_tokens)
        tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
        logits = model(tokens_padded)
        predicted_token = torch.argmax(logits[0, num_tokens-1, :]).item()
        tokens.append(predicted_token)
    return tokenizer.decode(tokens)
    
print("Original: ", tokenizer.decode(train_inputs[2].tolist())[:90])
print("Predicted:", inference(" director Takeshi Ozawa . A large team of writers handled the script", max_new_tokens=6))
Original:   director Takeshi Ozawa . A large team of writers handled the script . The game 's opening
Predicted:  director Takeshi Ozawa . A large team of writers handled the script . The game 's opening

1.4 Real Training Loop

Using tiktoken, and a small dataset, we were able to overfit a small dataset and perform inference examples. However, in order to train a LLM that can do useful things we will need a larger dataset that won't be able to fit in memory. We will also need an efficient way to tokenize the dataset and load it into pytorch tensors.

1.4.1 Huggingface Streaming Dataset

Huggingface's datasets library makes this process very easy.

# Load dataset in streaming mode
ds = load_dataset("abisee/cnn_dailymail", "3.0.0", split="train")
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")

def check_dataset_exists():
    try:
        # Attempt to load the dataset with reuse_cache_if_exists mode
        load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
        load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")
        return True
    except FileNotFoundError:
        return False
    
if not check_dataset_exists():
    print("Tokenized dataset does not exist locally... Generating and saving to disk.")

    def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, train_rows=100_000, test_rows=500):
        """
        Tokenizes and chunks the dataset into fixed-length 512-token segments.
        The 'target' sequence is shifted left by 1 token.
        Stops after generating `train_rows + test_rows` tokenized chunks.
        """
        buffer = []  # Rolling buffer for tokens
        row_count = 0

        for example in dataset:
            tokens = tokenizer(example["article"], truncation=False, padding=False)['input_ids']
            buffer.extend(tokens)

            # Yield full chunks until we reach train_rows + test_rows
            while len(buffer) >= chunk_size + 1:  # +1 to ensure we can shift target
                if row_count >= (train_rows + test_rows):
                    return  # Stop yielding once enough rows are reached

                # Create input-target pairs
                input_chunk = buffer[:chunk_size]         # First 512 tokens
                target_chunk = buffer[1:chunk_size + 1]  # Shifted by 1 token
                
                # Assign to train or test split
                split = "train" if row_count < train_rows else "test"

                yield {
                    "split": split,
                    "input": input_chunk, 
                    "target": target_chunk
                }
                
                buffer = buffer[chunk_size:]  # Remove used tokens
                row_count += 1

    # Set the max number of rows for training and testing
    TRAIN_ROWS = 1400000  # Adjust as needed
    TEST_ROWS = 500   # Adjust as needed
    CHUNK_SIZE = 128

    # Convert generator to a Hugging Face Dataset
    tokenized_ds = Dataset.from_generator(lambda: tokenize_and_chunk(ds, hf_tokenizer,chunk_size=CHUNK_SIZE, train_rows=TRAIN_ROWS, test_rows=TEST_ROWS))

    # Split the dataset into `train` and `test`
    dataset_splits = tokenized_ds.train_test_split(test_size=TEST_ROWS / (TRAIN_ROWS + TEST_ROWS), seed=42)

    # Save to disk
    dataset_splits["train"].to_parquet("cnn_dailymail_train.parquet")
    dataset_splits["test"].to_parquet("cnn_dailymail_test.parquet")

    print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows.")
else:
    print("Tokenized dataset already exists locally.")
README.md: 0.00B [00:00, ?B/s]
3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]
3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]
3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]
3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]
3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]
Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]
tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]
vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
Tokenized dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1194 > 1024). Running this sequence through the model will result in indexing errors
Creating parquet from Arrow format:   0%|          | 0/29 [00:00<?, ?ba/s]
Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]
✅ Saved 1400000 train rows and 500 test rows.

1.4.2 Modified Training Loop

We have tokenized the dataset in chunks, and saved it to the disk as a parquet file. This is a scalable approach that will allow us to train the model while never having the entire dataset in memory. Let's make a more robust training loop that ensures we are saving off the model at various checkpoints.

# Example config:
batch_size = 64
sequence_len = 128
num_steps = 150000
accumulation_steps = 100


# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")

# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])

# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

config = GPTConfig(
    vocab_size=hf_tokenizer.vocab_size,
    n_layer=8,   # fewer layers for a quick demo
    n_head=8,
    n_embd=128,
    seq_len=sequence_len,
)

# Create the GPT model
model = GPTModel(config)

use_existing_model = os.path.exists("./pretrain_final.pth")
# Check if pre-trained model exists
if use_existing_model:
    model = torch.load("./pretrain_final.pth", weights_only=False)
    print("Loaded pre-trained model from ./pretrain_final.pth, skipping training loop.")

else:
    # Define the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)


    # Define Scheduler
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.3, patience=10, min_lr=5e-6, threshold=1e-4)


    # Training loop
    losses = []
    test_losses = []
    accumulator = 0
    accumulator_loss = 0
    start_time = time.time()
    for i in range(num_steps):
        model.train()
        example = next(train_dataloader)
        train_input = example["input"].to(device)
        train_target = example["target"].to(device)

        logits = model(train_input)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Update weights
        optimizer.step()
        optimizer.zero_grad()

        accumulator += 1
        accumulator_loss += loss.item()

        
        if accumulator >= accumulation_steps:
            losses.append(accumulator_loss / accumulation_steps)
            accumulator = 0
            accumulator_loss = 0
            model.eval()
            test_loss = 0
            test_accumulator = 0
            with torch.no_grad():
                for test_example in test_dataloader:
                    test_input = test_example["input"].to(device)
                    test_target = test_example["target"].to(device)
                    test_logits = model(test_input)
                    test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
                    test_accumulator += 1
                test_losses.append(test_loss / test_accumulator)
                elapsed_time = time.time() - start_time
                print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}, LR: {optimizer.param_groups[0]['lr']}, Elapsed Time: {elapsed_time:.2f} seconds")
                test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
                scheduler.step(test_losses[-1])

   
        if (i+1) % 50000 == 0:
            # Save the model checkpoint
            print(f"Saving model checkpoint at step {i+1}")
            torch.save(model, f"./model_checkpoint_{i}.pt")
        
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Step 100/150000, Loss: 8.29131314754486, Test Loss: 7.32573276758194, LR: 0.0005, Elapsed Time: 2.37 seconds
Step 200/150000, Loss: 7.117442603111267, Test Loss: 6.908363699913025, LR: 0.0005, Elapsed Time: 4.71 seconds
Step 300/150000, Loss: 6.772330284118652, Test Loss: 6.633603274822235, LR: 0.0005, Elapsed Time: 7.06 seconds
Step 400/150000, Loss: 6.547142271995544, Test Loss: 6.444078981876373, LR: 0.0005, Elapsed Time: 9.42 seconds
Step 500/150000, Loss: 6.379139184951782, Test Loss: 6.298685610294342, LR: 0.0005, Elapsed Time: 11.76 seconds
Step 600/150000, Loss: 6.24203019618988, Test Loss: 6.186869561672211, LR: 0.0005, Elapsed Time: 14.16 seconds
Step 700/150000, Loss: 6.142061977386475, Test Loss: 6.080222308635712, LR: 0.0005, Elapsed Time: 16.54 seconds
Step 800/150000, Loss: 6.046715126037598, Test Loss: 5.995494902133942, LR: 0.0005, Elapsed Time: 18.92 seconds
Step 900/150000, Loss: 5.967564377784729, Test Loss: 5.926055014133453, LR: 0.0005, Elapsed Time: 21.27 seconds
Step 1000/150000, Loss: 5.8898282241821285, Test Loss: 5.863428473472595, LR: 0.0005, Elapsed Time: 23.61 seconds
Step 1100/150000, Loss: 5.845172233581543, Test Loss: 5.811928570270538, LR: 0.0005, Elapsed Time: 25.96 seconds
Step 1200/150000, Loss: 5.782345604896546, Test Loss: 5.755714535713196, LR: 0.0005, Elapsed Time: 28.30 seconds
Step 1300/150000, Loss: 5.727044267654419, Test Loss: 5.706365346908569, LR: 0.0005, Elapsed Time: 30.64 seconds
Step 1400/150000, Loss: 5.693611726760865, Test Loss: 5.660225093364716, LR: 0.0005, Elapsed Time: 32.98 seconds
Step 1500/150000, Loss: 5.664098086357117, Test Loss: 5.619187712669373, LR: 0.0005, Elapsed Time: 35.32 seconds
Step 1600/150000, Loss: 5.59910029411316, Test Loss: 5.578723967075348, LR: 0.0005, Elapsed Time: 37.66 seconds
Step 1700/150000, Loss: 5.576466584205628, Test Loss: 5.542311728000641, LR: 0.0005, Elapsed Time: 40.01 seconds
Step 1800/150000, Loss: 5.544434690475464, Test Loss: 5.51231062412262, LR: 0.0005, Elapsed Time: 42.35 seconds
Step 1900/150000, Loss: 5.49370548248291, Test Loss: 5.483141243457794, LR: 0.0005, Elapsed Time: 44.70 seconds
Step 2000/150000, Loss: 5.473454585075379, Test Loss: 5.447272479534149, LR: 0.0005, Elapsed Time: 47.04 seconds
Step 2100/150000, Loss: 5.453946571350098, Test Loss: 5.418991446495056, LR: 0.0005, Elapsed Time: 49.39 seconds
Step 2200/150000, Loss: 5.409182085990905, Test Loss: 5.391051650047302, LR: 0.0005, Elapsed Time: 51.74 seconds
Step 2300/150000, Loss: 5.390555815696716, Test Loss: 5.3657426834106445, LR: 0.0005, Elapsed Time: 54.10 seconds
Step 2400/150000, Loss: 5.367695422172546, Test Loss: 5.34090256690979, LR: 0.0005, Elapsed Time: 56.46 seconds
Step 2500/150000, Loss: 5.340488543510437, Test Loss: 5.318245351314545, LR: 0.0005, Elapsed Time: 58.80 seconds
Step 2600/150000, Loss: 5.316046500205994, Test Loss: 5.295966148376465, LR: 0.0005, Elapsed Time: 61.15 seconds
Step 2700/150000, Loss: 5.284513239860535, Test Loss: 5.27100795507431, LR: 0.0005, Elapsed Time: 63.49 seconds
Step 2800/150000, Loss: 5.266137585639954, Test Loss: 5.260135233402252, LR: 0.0005, Elapsed Time: 65.84 seconds
Step 2900/150000, Loss: 5.252311511039734, Test Loss: 5.238052546977997, LR: 0.0005, Elapsed Time: 68.20 seconds
Step 3000/150000, Loss: 5.227418856620789, Test Loss: 5.220346033573151, LR: 0.0005, Elapsed Time: 70.55 seconds
Step 3100/150000, Loss: 5.21288724899292, Test Loss: 5.200305819511414, LR: 0.0005, Elapsed Time: 72.92 seconds
Step 3200/150000, Loss: 5.2047114753723145, Test Loss: 5.187534809112549, LR: 0.0005, Elapsed Time: 75.28 seconds
Step 3300/150000, Loss: 5.183226270675659, Test Loss: 5.168719530105591, LR: 0.0005, Elapsed Time: 77.64 seconds
Step 3400/150000, Loss: 5.169418730735779, Test Loss: 5.152598857879639, LR: 0.0005, Elapsed Time: 80.00 seconds
Step 3500/150000, Loss: 5.161659555435181, Test Loss: 5.137952983379364, LR: 0.0005, Elapsed Time: 82.37 seconds
Step 3600/150000, Loss: 5.146551513671875, Test Loss: 5.118983626365662, LR: 0.0005, Elapsed Time: 84.73 seconds
Step 3700/150000, Loss: 5.125599775314331, Test Loss: 5.1149380803108215, LR: 0.0005, Elapsed Time: 87.09 seconds
Step 3800/150000, Loss: 5.114699449539184, Test Loss: 5.094923257827759, LR: 0.0005, Elapsed Time: 89.46 seconds
Step 3900/150000, Loss: 5.102765526771545, Test Loss: 5.0881869196891785, LR: 0.0005, Elapsed Time: 91.83 seconds
Step 4000/150000, Loss: 5.089006838798523, Test Loss: 5.069235861301422, LR: 0.0005, Elapsed Time: 94.20 seconds
Step 4100/150000, Loss: 5.055593185424804, Test Loss: 5.056783497333527, LR: 0.0005, Elapsed Time: 96.56 seconds
Step 4200/150000, Loss: 5.056928520202637, Test Loss: 5.047412812709808, LR: 0.0005, Elapsed Time: 98.93 seconds
Step 4300/150000, Loss: 5.054739732742309, Test Loss: 5.037115573883057, LR: 0.0005, Elapsed Time: 101.30 seconds
Step 4400/150000, Loss: 5.037222838401794, Test Loss: 5.0259793400764465, LR: 0.0005, Elapsed Time: 103.66 seconds
Step 4500/150000, Loss: 5.025108327865601, Test Loss: 5.016159236431122, LR: 0.0005, Elapsed Time: 106.03 seconds
Step 4600/150000, Loss: 5.020468945503235, Test Loss: 5.0058212876319885, LR: 0.0005, Elapsed Time: 108.39 seconds
Step 4700/150000, Loss: 5.004053258895874, Test Loss: 4.998083412647247, LR: 0.0005, Elapsed Time: 110.76 seconds
Step 4800/150000, Loss: 5.007849006652832, Test Loss: 4.989589154720306, LR: 0.0005, Elapsed Time: 113.12 seconds
Step 4900/150000, Loss: 4.992166042327881, Test Loss: 4.977273404598236, LR: 0.0005, Elapsed Time: 115.49 seconds
Step 5000/150000, Loss: 4.968843655586243, Test Loss: 4.9670655727386475, LR: 0.0005, Elapsed Time: 117.85 seconds
Step 5100/150000, Loss: 4.963251438140869, Test Loss: 4.9569602608680725, LR: 0.0005, Elapsed Time: 120.21 seconds
Step 5200/150000, Loss: 4.962724876403809, Test Loss: 4.955780804157257, LR: 0.0005, Elapsed Time: 122.59 seconds
Step 5300/150000, Loss: 4.94444486618042, Test Loss: 4.9460033774375916, LR: 0.0005, Elapsed Time: 124.96 seconds
Step 5400/150000, Loss: 4.944139776229858, Test Loss: 4.934847712516785, LR: 0.0005, Elapsed Time: 127.33 seconds
Step 5500/150000, Loss: 4.938820266723633, Test Loss: 4.927900075912476, LR: 0.0005, Elapsed Time: 129.69 seconds
Step 5600/150000, Loss: 4.9369744777679445, Test Loss: 4.919080018997192, LR: 0.0005, Elapsed Time: 132.06 seconds
Step 5700/150000, Loss: 4.914067401885986, Test Loss: 4.912622153759003, LR: 0.0005, Elapsed Time: 134.43 seconds
Step 5800/150000, Loss: 4.901670680046082, Test Loss: 4.909731388092041, LR: 0.0005, Elapsed Time: 136.79 seconds
Step 5900/150000, Loss: 4.9029511451721195, Test Loss: 4.900074243545532, LR: 0.0005, Elapsed Time: 139.15 seconds
Step 6000/150000, Loss: 4.910959076881409, Test Loss: 4.892938494682312, LR: 0.0005, Elapsed Time: 141.52 seconds
Step 6100/150000, Loss: 4.888806700706482, Test Loss: 4.88437157869339, LR: 0.0005, Elapsed Time: 143.88 seconds
Step 6200/150000, Loss: 4.889922270774841, Test Loss: 4.875905513763428, LR: 0.0005, Elapsed Time: 146.25 seconds
Step 6300/150000, Loss: 4.882483267784119, Test Loss: 4.867132604122162, LR: 0.0005, Elapsed Time: 148.62 seconds
Step 6400/150000, Loss: 4.865064640045166, Test Loss: 4.86199426651001, LR: 0.0005, Elapsed Time: 151.00 seconds
Step 6500/150000, Loss: 4.86213840007782, Test Loss: 4.859926342964172, LR: 0.0005, Elapsed Time: 153.37 seconds
Step 6600/150000, Loss: 4.864676365852356, Test Loss: 4.854439795017242, LR: 0.0005, Elapsed Time: 155.73 seconds
Step 6700/150000, Loss: 4.8568338203430175, Test Loss: 4.84831690788269, LR: 0.0005, Elapsed Time: 158.10 seconds
Step 6800/150000, Loss: 4.839257125854492, Test Loss: 4.841279983520508, LR: 0.0005, Elapsed Time: 160.48 seconds
Step 6900/150000, Loss: 4.839060854911804, Test Loss: 4.835512042045593, LR: 0.0005, Elapsed Time: 162.87 seconds
Step 7000/150000, Loss: 4.835203504562378, Test Loss: 4.827293813228607, LR: 0.0005, Elapsed Time: 165.25 seconds
Step 7100/150000, Loss: 4.81978217124939, Test Loss: 4.821726858615875, LR: 0.0005, Elapsed Time: 167.63 seconds
Step 7200/150000, Loss: 4.819846754074097, Test Loss: 4.812400221824646, LR: 0.0005, Elapsed Time: 170.00 seconds
Step 7300/150000, Loss: 4.811172308921814, Test Loss: 4.806427240371704, LR: 0.0005, Elapsed Time: 172.38 seconds
Step 7400/150000, Loss: 4.816269154548645, Test Loss: 4.809043526649475, LR: 0.0005, Elapsed Time: 174.77 seconds
Step 7500/150000, Loss: 4.802456269264221, Test Loss: 4.805356919765472, LR: 0.0005, Elapsed Time: 177.14 seconds
Step 7600/150000, Loss: 4.80144115447998, Test Loss: 4.794584631919861, LR: 0.0005, Elapsed Time: 179.51 seconds
Step 7700/150000, Loss: 4.802568163871765, Test Loss: 4.788528084754944, LR: 0.0005, Elapsed Time: 181.89 seconds
Step 7800/150000, Loss: 4.787138748168945, Test Loss: 4.786709249019623, LR: 0.0005, Elapsed Time: 184.27 seconds
Step 7900/150000, Loss: 4.784691848754883, Test Loss: 4.775909423828125, LR: 0.0005, Elapsed Time: 186.64 seconds
Step 8000/150000, Loss: 4.776335954666138, Test Loss: 4.773558974266052, LR: 0.0005, Elapsed Time: 189.00 seconds
Step 8100/150000, Loss: 4.779001107215882, Test Loss: 4.7703112959861755, LR: 0.0005, Elapsed Time: 191.39 seconds
Step 8200/150000, Loss: 4.767977137565612, Test Loss: 4.770996391773224, LR: 0.0005, Elapsed Time: 193.75 seconds
Step 8300/150000, Loss: 4.779855132102966, Test Loss: 4.761614084243774, LR: 0.0005, Elapsed Time: 196.13 seconds
Step 8400/150000, Loss: 4.761454000473022, Test Loss: 4.757183194160461, LR: 0.0005, Elapsed Time: 198.49 seconds
Step 8500/150000, Loss: 4.767165241241455, Test Loss: 4.748279631137848, LR: 0.0005, Elapsed Time: 200.86 seconds
Step 8600/150000, Loss: 4.755273714065551, Test Loss: 4.753763735294342, LR: 0.0005, Elapsed Time: 203.23 seconds
Step 8700/150000, Loss: 4.746888866424561, Test Loss: 4.743083596229553, LR: 0.0005, Elapsed Time: 205.60 seconds
Step 8800/150000, Loss: 4.7385933971405025, Test Loss: 4.739474534988403, LR: 0.0005, Elapsed Time: 207.97 seconds
Step 8900/150000, Loss: 4.744684042930603, Test Loss: 4.730343699455261, LR: 0.0005, Elapsed Time: 210.34 seconds
Step 9000/150000, Loss: 4.744727168083191, Test Loss: 4.725610077381134, LR: 0.0005, Elapsed Time: 212.73 seconds
Step 9100/150000, Loss: 4.725877265930176, Test Loss: 4.724116683006287, LR: 0.0005, Elapsed Time: 215.11 seconds
Step 9200/150000, Loss: 4.736619505882263, Test Loss: 4.724285244941711, LR: 0.0005, Elapsed Time: 217.48 seconds
Step 9300/150000, Loss: 4.73272656917572, Test Loss: 4.719827175140381, LR: 0.0005, Elapsed Time: 219.86 seconds
Step 9400/150000, Loss: 4.715517134666443, Test Loss: 4.717442572116852, LR: 0.0005, Elapsed Time: 222.24 seconds
Step 9500/150000, Loss: 4.707191839218139, Test Loss: 4.7094573974609375, LR: 0.0005, Elapsed Time: 224.61 seconds
Step 9600/150000, Loss: 4.722323498725891, Test Loss: 4.710386574268341, LR: 0.0005, Elapsed Time: 226.99 seconds
Step 9700/150000, Loss: 4.714656558036804, Test Loss: 4.702428579330444, LR: 0.0005, Elapsed Time: 229.39 seconds
Step 9800/150000, Loss: 4.699899215698242, Test Loss: 4.701009273529053, LR: 0.0005, Elapsed Time: 231.77 seconds
Step 9900/150000, Loss: 4.704809041023254, Test Loss: 4.698681116104126, LR: 0.0005, Elapsed Time: 234.16 seconds
Step 10000/150000, Loss: 4.6960454845428465, Test Loss: 4.695659637451172, LR: 0.0005, Elapsed Time: 236.54 seconds
Step 10100/150000, Loss: 4.702823138236999, Test Loss: 4.688012540340424, LR: 0.0005, Elapsed Time: 238.91 seconds
Step 10200/150000, Loss: 4.6997597217559814, Test Loss: 4.688770413398743, LR: 0.0005, Elapsed Time: 241.29 seconds
Step 10300/150000, Loss: 4.690613722801208, Test Loss: 4.682490944862366, LR: 0.0005, Elapsed Time: 243.67 seconds
Step 10400/150000, Loss: 4.7003649854660035, Test Loss: 4.68109256029129, LR: 0.0005, Elapsed Time: 246.05 seconds
Step 10500/150000, Loss: 4.681055660247803, Test Loss: 4.677388548851013, LR: 0.0005, Elapsed Time: 248.43 seconds
Step 10600/150000, Loss: 4.694364914894104, Test Loss: 4.672137558460236, LR: 0.0005, Elapsed Time: 250.81 seconds
Step 10700/150000, Loss: 4.672147765159607, Test Loss: 4.6690815687179565, LR: 0.0005, Elapsed Time: 253.19 seconds
Step 10800/150000, Loss: 4.690005230903625, Test Loss: 4.667634189128876, LR: 0.0005, Elapsed Time: 255.57 seconds
Step 10900/150000, Loss: 4.688742294311523, Test Loss: 4.664335489273071, LR: 0.0005, Elapsed Time: 257.95 seconds
Step 11000/150000, Loss: 4.676637477874756, Test Loss: 4.665681898593903, LR: 0.0005, Elapsed Time: 260.32 seconds
Step 11100/150000, Loss: 4.655379319190979, Test Loss: 4.660738587379456, LR: 0.0005, Elapsed Time: 262.69 seconds
Step 11200/150000, Loss: 4.6605712890625, Test Loss: 4.656876027584076, LR: 0.0005, Elapsed Time: 265.06 seconds
Step 11300/150000, Loss: 4.660525140762329, Test Loss: 4.65769898891449, LR: 0.0005, Elapsed Time: 267.45 seconds
Step 11400/150000, Loss: 4.657942762374878, Test Loss: 4.656493246555328, LR: 0.0005, Elapsed Time: 269.82 seconds
Step 11500/150000, Loss: 4.662105889320373, Test Loss: 4.649299919605255, LR: 0.0005, Elapsed Time: 272.20 seconds
Step 11600/150000, Loss: 4.657964215278626, Test Loss: 4.64749014377594, LR: 0.0005, Elapsed Time: 274.59 seconds
Step 11700/150000, Loss: 4.648822097778321, Test Loss: 4.645352005958557, LR: 0.0005, Elapsed Time: 276.97 seconds
Step 11800/150000, Loss: 4.652387123107911, Test Loss: 4.639504134654999, LR: 0.0005, Elapsed Time: 279.34 seconds
Step 11900/150000, Loss: 4.6380134963989255, Test Loss: 4.6393778920173645, LR: 0.0005, Elapsed Time: 281.74 seconds
Step 12000/150000, Loss: 4.651171927452087, Test Loss: 4.6370962262153625, LR: 0.0005, Elapsed Time: 284.13 seconds
Step 12100/150000, Loss: 4.644910125732422, Test Loss: 4.63198983669281, LR: 0.0005, Elapsed Time: 286.52 seconds
Step 12200/150000, Loss: 4.645656332969666, Test Loss: 4.628260791301727, LR: 0.0005, Elapsed Time: 288.89 seconds
Step 12300/150000, Loss: 4.640872015953064, Test Loss: 4.628953993320465, LR: 0.0005, Elapsed Time: 291.28 seconds
Step 12400/150000, Loss: 4.630461206436157, Test Loss: 4.629278719425201, LR: 0.0005, Elapsed Time: 293.67 seconds
Step 12500/150000, Loss: 4.62983355998993, Test Loss: 4.623811304569244, LR: 0.0005, Elapsed Time: 296.04 seconds
Step 12600/150000, Loss: 4.635344209671021, Test Loss: 4.620035171508789, LR: 0.0005, Elapsed Time: 298.42 seconds
Step 12700/150000, Loss: 4.6240947675704955, Test Loss: 4.618345737457275, LR: 0.0005, Elapsed Time: 300.80 seconds
Step 12800/150000, Loss: 4.628077640533447, Test Loss: 4.61910343170166, LR: 0.0005, Elapsed Time: 303.20 seconds
Step 12900/150000, Loss: 4.617295336723328, Test Loss: 4.615666449069977, LR: 0.0005, Elapsed Time: 305.58 seconds
Step 13000/150000, Loss: 4.618827686309815, Test Loss: 4.60864382982254, LR: 0.0005, Elapsed Time: 307.96 seconds
Step 13100/150000, Loss: 4.607299704551696, Test Loss: 4.609173536300659, LR: 0.0005, Elapsed Time: 310.33 seconds
Step 13200/150000, Loss: 4.60929470539093, Test Loss: 4.60765665769577, LR: 0.0005, Elapsed Time: 312.72 seconds
Step 13300/150000, Loss: 4.6194158554077145, Test Loss: 4.606265366077423, LR: 0.0005, Elapsed Time: 315.11 seconds
Step 13400/150000, Loss: 4.60871750831604, Test Loss: 4.604590952396393, LR: 0.0005, Elapsed Time: 317.51 seconds
Step 13500/150000, Loss: 4.60270761013031, Test Loss: 4.601124465465546, LR: 0.0005, Elapsed Time: 319.90 seconds
Step 13600/150000, Loss: 4.595364627838134, Test Loss: 4.595627248287201, LR: 0.0005, Elapsed Time: 322.30 seconds
Step 13700/150000, Loss: 4.605753192901611, Test Loss: 4.5982765555381775, LR: 0.0005, Elapsed Time: 324.69 seconds
Step 13800/150000, Loss: 4.591376285552979, Test Loss: 4.593955039978027, LR: 0.0005, Elapsed Time: 327.11 seconds
Step 13900/150000, Loss: 4.597605652809143, Test Loss: 4.592446982860565, LR: 0.0005, Elapsed Time: 329.49 seconds
Step 14000/150000, Loss: 4.601936831474304, Test Loss: 4.588154673576355, LR: 0.0005, Elapsed Time: 331.90 seconds
Step 14100/150000, Loss: 4.590036926269531, Test Loss: 4.585490107536316, LR: 0.0005, Elapsed Time: 334.29 seconds
Step 14200/150000, Loss: 4.589709963798523, Test Loss: 4.588991165161133, LR: 0.0005, Elapsed Time: 336.70 seconds
Step 14300/150000, Loss: 4.5951982021331785, Test Loss: 4.587121069431305, LR: 0.0005, Elapsed Time: 339.09 seconds
Step 14400/150000, Loss: 4.585908894538879, Test Loss: 4.579719245433807, LR: 0.0005, Elapsed Time: 341.49 seconds
Step 14500/150000, Loss: 4.589031548500061, Test Loss: 4.57997453212738, LR: 0.0005, Elapsed Time: 343.89 seconds
Step 14600/150000, Loss: 4.5824239587783815, Test Loss: 4.576378583908081, LR: 0.0005, Elapsed Time: 346.28 seconds
Step 14700/150000, Loss: 4.56362606048584, Test Loss: 4.574739456176758, LR: 0.0005, Elapsed Time: 348.67 seconds
Step 14800/150000, Loss: 4.576689658164978, Test Loss: 4.57117760181427, LR: 0.0005, Elapsed Time: 351.07 seconds
Step 14900/150000, Loss: 4.57519483089447, Test Loss: 4.568003475666046, LR: 0.0005, Elapsed Time: 353.47 seconds
Step 15000/150000, Loss: 4.571347131729126, Test Loss: 4.570954501628876, LR: 0.0005, Elapsed Time: 355.86 seconds
Step 15100/150000, Loss: 4.577577257156372, Test Loss: 4.567755937576294, LR: 0.0005, Elapsed Time: 358.25 seconds
Step 15200/150000, Loss: 4.562372999191284, Test Loss: 4.568525493144989, LR: 0.0005, Elapsed Time: 360.65 seconds
Step 15300/150000, Loss: 4.559005184173584, Test Loss: 4.560401022434235, LR: 0.0005, Elapsed Time: 363.04 seconds
Step 15400/150000, Loss: 4.5553831148147585, Test Loss: 4.561410963535309, LR: 0.0005, Elapsed Time: 365.45 seconds
Step 15500/150000, Loss: 4.5776596355438235, Test Loss: 4.559787631034851, LR: 0.0005, Elapsed Time: 367.85 seconds
Step 15600/150000, Loss: 4.548686604499817, Test Loss: 4.555811166763306, LR: 0.0005, Elapsed Time: 370.26 seconds
Step 15700/150000, Loss: 4.576506309509277, Test Loss: 4.5573484897613525, LR: 0.0005, Elapsed Time: 372.66 seconds
Step 15800/150000, Loss: 4.565245881080627, Test Loss: 4.554625511169434, LR: 0.0005, Elapsed Time: 375.06 seconds
Step 15900/150000, Loss: 4.546299142837524, Test Loss: 4.552566707134247, LR: 0.0005, Elapsed Time: 377.46 seconds
Step 16000/150000, Loss: 4.540788903236389, Test Loss: 4.557130753993988, LR: 0.0005, Elapsed Time: 379.87 seconds
Step 16100/150000, Loss: 4.5580313730239865, Test Loss: 4.548960506916046, LR: 0.0005, Elapsed Time: 382.27 seconds
Step 16200/150000, Loss: 4.553918199539185, Test Loss: 4.548383891582489, LR: 0.0005, Elapsed Time: 384.67 seconds
Step 16300/150000, Loss: 4.553248920440674, Test Loss: 4.543776452541351, LR: 0.0005, Elapsed Time: 387.06 seconds
Step 16400/150000, Loss: 4.550207643508911, Test Loss: 4.54317843914032, LR: 0.0005, Elapsed Time: 389.44 seconds
Step 16500/150000, Loss: 4.546432008743286, Test Loss: 4.539352059364319, LR: 0.0005, Elapsed Time: 391.84 seconds
Step 16600/150000, Loss: 4.546410517692566, Test Loss: 4.541164398193359, LR: 0.0005, Elapsed Time: 394.23 seconds
Step 16700/150000, Loss: 4.54778567314148, Test Loss: 4.534104883670807, LR: 0.0005, Elapsed Time: 396.62 seconds
Step 16800/150000, Loss: 4.544259872436523, Test Loss: 4.535196006298065, LR: 0.0005, Elapsed Time: 399.03 seconds
Step 16900/150000, Loss: 4.540018525123596, Test Loss: 4.535028040409088, LR: 0.0005, Elapsed Time: 401.43 seconds
Step 17000/150000, Loss: 4.53421151638031, Test Loss: 4.530624985694885, LR: 0.0005, Elapsed Time: 403.84 seconds
Step 17100/150000, Loss: 4.528057346343994, Test Loss: 4.531696438789368, LR: 0.0005, Elapsed Time: 406.24 seconds
Step 17200/150000, Loss: 4.545023851394653, Test Loss: 4.53470242023468, LR: 0.0005, Elapsed Time: 408.64 seconds
Step 17300/150000, Loss: 4.53915090084076, Test Loss: 4.526281297206879, LR: 0.0005, Elapsed Time: 411.06 seconds
Step 17400/150000, Loss: 4.538897013664245, Test Loss: 4.524180054664612, LR: 0.0005, Elapsed Time: 413.46 seconds
Step 17500/150000, Loss: 4.50747362613678, Test Loss: 4.526680409908295, LR: 0.0005, Elapsed Time: 415.88 seconds
Step 17600/150000, Loss: 4.5248877763748165, Test Loss: 4.519170045852661, LR: 0.0005, Elapsed Time: 418.28 seconds
Step 17700/150000, Loss: 4.5213628768920895, Test Loss: 4.52188104391098, LR: 0.0005, Elapsed Time: 420.69 seconds
Step 17800/150000, Loss: 4.534228496551513, Test Loss: 4.523009479045868, LR: 0.0005, Elapsed Time: 423.11 seconds
Step 17900/150000, Loss: 4.525176420211792, Test Loss: 4.51629912853241, LR: 0.0005, Elapsed Time: 425.51 seconds
Step 18000/150000, Loss: 4.5262261533737185, Test Loss: 4.518033802509308, LR: 0.0005, Elapsed Time: 427.92 seconds
Step 18100/150000, Loss: 4.519547142982483, Test Loss: 4.521687209606171, LR: 0.0005, Elapsed Time: 430.32 seconds
Step 18200/150000, Loss: 4.52007483959198, Test Loss: 4.516174495220184, LR: 0.0005, Elapsed Time: 432.74 seconds
Step 18300/150000, Loss: 4.509054546356201, Test Loss: 4.51198947429657, LR: 0.0005, Elapsed Time: 435.14 seconds
Step 18400/150000, Loss: 4.511096715927124, Test Loss: 4.510901987552643, LR: 0.0005, Elapsed Time: 437.55 seconds
Step 18500/150000, Loss: 4.520346689224243, Test Loss: 4.510843813419342, LR: 0.0005, Elapsed Time: 439.97 seconds
Step 18600/150000, Loss: 4.519790368080139, Test Loss: 4.507764577865601, LR: 0.0005, Elapsed Time: 442.37 seconds
Step 18700/150000, Loss: 4.522249169349671, Test Loss: 4.503726065158844, LR: 0.0005, Elapsed Time: 444.77 seconds
Step 18800/150000, Loss: 4.512528171539307, Test Loss: 4.505193412303925, LR: 0.0005, Elapsed Time: 447.18 seconds
Step 18900/150000, Loss: 4.512978367805481, Test Loss: 4.502611815929413, LR: 0.0005, Elapsed Time: 449.58 seconds
Step 19000/150000, Loss: 4.500606164932251, Test Loss: 4.504136919975281, LR: 0.0005, Elapsed Time: 451.99 seconds
Step 19100/150000, Loss: 4.510384464263916, Test Loss: 4.502659559249878, LR: 0.0005, Elapsed Time: 454.39 seconds
Step 19200/150000, Loss: 4.500471243858337, Test Loss: 4.497892618179321, LR: 0.0005, Elapsed Time: 456.78 seconds
Step 19300/150000, Loss: 4.493755359649658, Test Loss: 4.497722685337067, LR: 0.0005, Elapsed Time: 459.18 seconds
Step 19400/150000, Loss: 4.504688720703125, Test Loss: 4.4977006316185, LR: 0.0005, Elapsed Time: 461.60 seconds
Step 19500/150000, Loss: 4.50128930568695, Test Loss: 4.498151659965515, LR: 0.0005, Elapsed Time: 464.00 seconds
Step 19600/150000, Loss: 4.498463568687439, Test Loss: 4.493999004364014, LR: 0.0005, Elapsed Time: 466.42 seconds
Step 19700/150000, Loss: 4.492149424552918, Test Loss: 4.492654085159302, LR: 0.0005, Elapsed Time: 468.83 seconds
Step 19800/150000, Loss: 4.49811240196228, Test Loss: 4.490227580070496, LR: 0.0005, Elapsed Time: 471.26 seconds
Step 19900/150000, Loss: 4.504788327217102, Test Loss: 4.491574704647064, LR: 0.0005, Elapsed Time: 473.67 seconds
Step 20000/150000, Loss: 4.504937472343445, Test Loss: 4.487073063850403, LR: 0.0005, Elapsed Time: 476.09 seconds
Step 20100/150000, Loss: 4.488566756248474, Test Loss: 4.4853785037994385, LR: 0.0005, Elapsed Time: 478.50 seconds
Step 20200/150000, Loss: 4.486374454498291, Test Loss: 4.488079309463501, LR: 0.0005, Elapsed Time: 480.90 seconds
Step 20300/150000, Loss: 4.48078682422638, Test Loss: 4.483373641967773, LR: 0.0005, Elapsed Time: 483.31 seconds
Step 20400/150000, Loss: 4.49204749584198, Test Loss: 4.48321533203125, LR: 0.0005, Elapsed Time: 485.72 seconds
Step 20500/150000, Loss: 4.4949183797836305, Test Loss: 4.481527030467987, LR: 0.0005, Elapsed Time: 488.14 seconds
Step 20600/150000, Loss: 4.476307392120361, Test Loss: 4.481425166130066, LR: 0.0005, Elapsed Time: 490.54 seconds
Step 20700/150000, Loss: 4.4942142772674565, Test Loss: 4.481167137622833, LR: 0.0005, Elapsed Time: 492.95 seconds
Step 20800/150000, Loss: 4.475279421806335, Test Loss: 4.478660464286804, LR: 0.0005, Elapsed Time: 495.34 seconds
Step 20900/150000, Loss: 4.473546757698059, Test Loss: 4.476366579532623, LR: 0.0005, Elapsed Time: 497.75 seconds
Step 21000/150000, Loss: 4.4948132133483885, Test Loss: 4.473031401634216, LR: 0.0005, Elapsed Time: 500.16 seconds
Step 21100/150000, Loss: 4.486972222328186, Test Loss: 4.479911386966705, LR: 0.0005, Elapsed Time: 502.57 seconds
Step 21200/150000, Loss: 4.487997083663941, Test Loss: 4.471247494220734, LR: 0.0005, Elapsed Time: 504.98 seconds
Step 21300/150000, Loss: 4.487250213623047, Test Loss: 4.472346365451813, LR: 0.0005, Elapsed Time: 507.37 seconds
Step 21400/150000, Loss: 4.478231444358825, Test Loss: 4.469191074371338, LR: 0.0005, Elapsed Time: 509.79 seconds
Step 21500/150000, Loss: 4.495006847381592, Test Loss: 4.466927647590637, LR: 0.0005, Elapsed Time: 512.21 seconds
Step 21600/150000, Loss: 4.467617635726929, Test Loss: 4.470831453800201, LR: 0.0005, Elapsed Time: 514.61 seconds
Step 21700/150000, Loss: 4.482581076622009, Test Loss: 4.468084454536438, LR: 0.0005, Elapsed Time: 517.02 seconds
Step 21800/150000, Loss: 4.460470933914184, Test Loss: 4.464775919914246, LR: 0.0005, Elapsed Time: 519.42 seconds
Step 21900/150000, Loss: 4.478560075759888, Test Loss: 4.467622339725494, LR: 0.0005, Elapsed Time: 521.81 seconds
Step 22000/150000, Loss: 4.464054803848267, Test Loss: 4.463311851024628, LR: 0.0005, Elapsed Time: 524.07 seconds
Step 22100/150000, Loss: 4.463960957527161, Test Loss: 4.460888147354126, LR: 0.0005, Elapsed Time: 526.32 seconds
Step 22200/150000, Loss: 4.4710403203964235, Test Loss: 4.459406912326813, LR: 0.0005, Elapsed Time: 528.58 seconds
Step 22300/150000, Loss: 4.467518124580383, Test Loss: 4.460662126541138, LR: 0.0005, Elapsed Time: 530.84 seconds
Step 22400/150000, Loss: 4.471158032417297, Test Loss: 4.460574209690094, LR: 0.0005, Elapsed Time: 533.10 seconds
Step 22500/150000, Loss: 4.4601910161972045, Test Loss: 4.459298193454742, LR: 0.0005, Elapsed Time: 535.35 seconds
Step 22600/150000, Loss: 4.468060073852539, Test Loss: 4.459884881973267, LR: 0.0005, Elapsed Time: 537.60 seconds
Step 22700/150000, Loss: 4.461251430511474, Test Loss: 4.45482325553894, LR: 0.0005, Elapsed Time: 539.87 seconds
Step 22800/150000, Loss: 4.451180963516236, Test Loss: 4.457395493984222, LR: 0.0005, Elapsed Time: 542.13 seconds
Step 22900/150000, Loss: 4.441122388839721, Test Loss: 4.4564467668533325, LR: 0.0005, Elapsed Time: 544.38 seconds
Step 23000/150000, Loss: 4.458741893768311, Test Loss: 4.450775980949402, LR: 0.0005, Elapsed Time: 546.64 seconds
Step 23100/150000, Loss: 4.446218361854553, Test Loss: 4.453252732753754, LR: 0.0005, Elapsed Time: 548.91 seconds
Step 23200/150000, Loss: 4.446700835227967, Test Loss: 4.449577867984772, LR: 0.0005, Elapsed Time: 551.17 seconds
Step 23300/150000, Loss: 4.459662127494812, Test Loss: 4.451347589492798, LR: 0.0005, Elapsed Time: 553.43 seconds
Step 23400/150000, Loss: 4.456698393821716, Test Loss: 4.447441279888153, LR: 0.0005, Elapsed Time: 555.70 seconds
Step 23500/150000, Loss: 4.446945352554321, Test Loss: 4.447851836681366, LR: 0.0005, Elapsed Time: 557.96 seconds
Step 23600/150000, Loss: 4.456517810821533, Test Loss: 4.4493526220321655, LR: 0.0005, Elapsed Time: 560.22 seconds
Step 23700/150000, Loss: 4.447642521858215, Test Loss: 4.448650002479553, LR: 0.0005, Elapsed Time: 562.48 seconds
Step 23800/150000, Loss: 4.445041146278381, Test Loss: 4.450870215892792, LR: 0.0005, Elapsed Time: 564.77 seconds
Step 23900/150000, Loss: 4.4457080364227295, Test Loss: 4.44720458984375, LR: 0.0005, Elapsed Time: 567.03 seconds
Step 24000/150000, Loss: 4.450307750701905, Test Loss: 4.44503778219223, LR: 0.0005, Elapsed Time: 569.30 seconds
Step 24100/150000, Loss: 4.4362435102462765, Test Loss: 4.441306471824646, LR: 0.0005, Elapsed Time: 571.57 seconds
Step 24200/150000, Loss: 4.449194784164429, Test Loss: 4.442341685295105, LR: 0.0005, Elapsed Time: 573.83 seconds
Step 24300/150000, Loss: 4.449483957290649, Test Loss: 4.446923017501831, LR: 0.0005, Elapsed Time: 576.09 seconds
Step 24400/150000, Loss: 4.438649277687073, Test Loss: 4.442263424396515, LR: 0.0005, Elapsed Time: 578.34 seconds
Step 24500/150000, Loss: 4.440170502662658, Test Loss: 4.439618647098541, LR: 0.0005, Elapsed Time: 580.59 seconds
Step 24600/150000, Loss: 4.4329999780654905, Test Loss: 4.433749496936798, LR: 0.0005, Elapsed Time: 582.85 seconds
Step 24700/150000, Loss: 4.430663704872131, Test Loss: 4.441073060035706, LR: 0.0005, Elapsed Time: 585.11 seconds
Step 24800/150000, Loss: 4.438053326606751, Test Loss: 4.435450494289398, LR: 0.0005, Elapsed Time: 587.36 seconds
Step 24900/150000, Loss: 4.423553228378296, Test Loss: 4.431729257106781, LR: 0.0005, Elapsed Time: 589.62 seconds
Step 25000/150000, Loss: 4.435029044151306, Test Loss: 4.431743741035461, LR: 0.0005, Elapsed Time: 591.90 seconds
Step 25100/150000, Loss: 4.4433708047866824, Test Loss: 4.435071527957916, LR: 0.0005, Elapsed Time: 594.16 seconds
Step 25200/150000, Loss: 4.437222938537598, Test Loss: 4.433805704116821, LR: 0.0005, Elapsed Time: 596.43 seconds
Step 25300/150000, Loss: 4.435000338554382, Test Loss: 4.431661307811737, LR: 0.0005, Elapsed Time: 598.70 seconds
Step 25400/150000, Loss: 4.440752034187317, Test Loss: 4.430679559707642, LR: 0.0005, Elapsed Time: 600.96 seconds
Step 25500/150000, Loss: 4.432871408462525, Test Loss: 4.427384614944458, LR: 0.0005, Elapsed Time: 603.23 seconds
Step 25600/150000, Loss: 4.432652449607849, Test Loss: 4.427745819091797, LR: 0.0005, Elapsed Time: 605.50 seconds
Step 25700/150000, Loss: 4.431785607337952, Test Loss: 4.427912890911102, LR: 0.0005, Elapsed Time: 607.76 seconds
Step 25800/150000, Loss: 4.4236441469192505, Test Loss: 4.426806747913361, LR: 0.0005, Elapsed Time: 610.03 seconds
Step 25900/150000, Loss: 4.4386088514328, Test Loss: 4.427948415279388, LR: 0.0005, Elapsed Time: 612.29 seconds
Step 26000/150000, Loss: 4.424503326416016, Test Loss: 4.4251527190208435, LR: 0.0005, Elapsed Time: 614.56 seconds
Step 26100/150000, Loss: 4.418567638397217, Test Loss: 4.424237251281738, LR: 0.0005, Elapsed Time: 616.84 seconds
Step 26200/150000, Loss: 4.428577823638916, Test Loss: 4.423579573631287, LR: 0.0005, Elapsed Time: 619.10 seconds
Step 26300/150000, Loss: 4.419262046813965, Test Loss: 4.422359645366669, LR: 0.0005, Elapsed Time: 621.37 seconds
Step 26400/150000, Loss: 4.41893328666687, Test Loss: 4.424150764942169, LR: 0.0005, Elapsed Time: 623.63 seconds
Step 26500/150000, Loss: 4.429289331436157, Test Loss: 4.420376837253571, LR: 0.0005, Elapsed Time: 625.90 seconds
Step 26600/150000, Loss: 4.426093044281006, Test Loss: 4.426370143890381, LR: 0.0005, Elapsed Time: 628.16 seconds
Step 26700/150000, Loss: 4.424168734550476, Test Loss: 4.425789177417755, LR: 0.0005, Elapsed Time: 630.42 seconds
Step 26800/150000, Loss: 4.425082349777222, Test Loss: 4.422569155693054, LR: 0.0005, Elapsed Time: 632.69 seconds
Step 26900/150000, Loss: 4.411873435974121, Test Loss: 4.423566162586212, LR: 0.0005, Elapsed Time: 634.96 seconds
Step 27000/150000, Loss: 4.410008502006531, Test Loss: 4.420478522777557, LR: 0.0005, Elapsed Time: 637.21 seconds
Step 27100/150000, Loss: 4.41948438167572, Test Loss: 4.421168088912964, LR: 0.0005, Elapsed Time: 639.48 seconds
Step 27200/150000, Loss: 4.4120335149765015, Test Loss: 4.421755135059357, LR: 0.0005, Elapsed Time: 641.74 seconds
Step 27300/150000, Loss: 4.416587877273559, Test Loss: 4.4179787039756775, LR: 0.0005, Elapsed Time: 644.00 seconds
Step 27400/150000, Loss: 4.410479040145874, Test Loss: 4.419560968875885, LR: 0.0005, Elapsed Time: 646.26 seconds
Step 27500/150000, Loss: 4.420058522224426, Test Loss: 4.41626113653183, LR: 0.0005, Elapsed Time: 648.51 seconds
Step 27600/150000, Loss: 4.411013612747192, Test Loss: 4.416729092597961, LR: 0.0005, Elapsed Time: 650.76 seconds
Step 27700/150000, Loss: 4.39310619354248, Test Loss: 4.417173445224762, LR: 0.0005, Elapsed Time: 653.03 seconds
Step 27800/150000, Loss: 4.412046957015991, Test Loss: 4.41639631986618, LR: 0.0005, Elapsed Time: 655.29 seconds
Step 27900/150000, Loss: 4.426310815811157, Test Loss: 4.4108535051345825, LR: 0.0005, Elapsed Time: 657.55 seconds
Step 28000/150000, Loss: 4.402293486595154, Test Loss: 4.411234736442566, LR: 0.0005, Elapsed Time: 659.81 seconds
Step 28100/150000, Loss: 4.411099290847778, Test Loss: 4.4097554087638855, LR: 0.0005, Elapsed Time: 662.08 seconds
Step 28200/150000, Loss: 4.406350111961364, Test Loss: 4.412723124027252, LR: 0.0005, Elapsed Time: 664.34 seconds
Step 28300/150000, Loss: 4.403218817710877, Test Loss: 4.414161562919617, LR: 0.0005, Elapsed Time: 666.62 seconds
Step 28400/150000, Loss: 4.40811327457428, Test Loss: 4.408611536026001, LR: 0.0005, Elapsed Time: 668.88 seconds
Step 28500/150000, Loss: 4.416187644004822, Test Loss: 4.409826397895813, LR: 0.0005, Elapsed Time: 671.13 seconds
Step 28600/150000, Loss: 4.402096080780029, Test Loss: 4.408366680145264, LR: 0.0005, Elapsed Time: 673.39 seconds
Step 28700/150000, Loss: 4.396033329963684, Test Loss: 4.409279406070709, LR: 0.0005, Elapsed Time: 675.65 seconds
Step 28800/150000, Loss: 4.399459547996521, Test Loss: 4.40605354309082, LR: 0.0005, Elapsed Time: 677.91 seconds
Step 28900/150000, Loss: 4.400590462684631, Test Loss: 4.407582402229309, LR: 0.0005, Elapsed Time: 680.17 seconds
Step 29000/150000, Loss: 4.391732535362244, Test Loss: 4.407607555389404, LR: 0.0005, Elapsed Time: 682.42 seconds
Step 29100/150000, Loss: 4.400592665672303, Test Loss: 4.405213356018066, LR: 0.0005, Elapsed Time: 684.68 seconds
Step 29200/150000, Loss: 4.388068222999573, Test Loss: 4.404130399227142, LR: 0.0005, Elapsed Time: 686.94 seconds
Step 29300/150000, Loss: 4.410907397270202, Test Loss: 4.408651649951935, LR: 0.0005, Elapsed Time: 689.19 seconds
Step 29400/150000, Loss: 4.387275414466858, Test Loss: 4.404370546340942, LR: 0.0005, Elapsed Time: 691.45 seconds
Step 29500/150000, Loss: 4.39313512802124, Test Loss: 4.4018683433532715, LR: 0.0005, Elapsed Time: 693.70 seconds
Step 29600/150000, Loss: 4.402139592170715, Test Loss: 4.400742292404175, LR: 0.0005, Elapsed Time: 695.96 seconds
Step 29700/150000, Loss: 4.381916408538818, Test Loss: 4.397399485111237, LR: 0.0005, Elapsed Time: 698.21 seconds
Step 29800/150000, Loss: 4.399872727394104, Test Loss: 4.392035067081451, LR: 0.0005, Elapsed Time: 700.46 seconds
Step 29900/150000, Loss: 4.382444906234741, Test Loss: 4.3972790241241455, LR: 0.0005, Elapsed Time: 702.74 seconds
Step 30000/150000, Loss: 4.393869862556458, Test Loss: 4.395161330699921, LR: 0.0005, Elapsed Time: 705.00 seconds
Step 30100/150000, Loss: 4.3928053665161135, Test Loss: 4.396532595157623, LR: 0.0005, Elapsed Time: 707.27 seconds
Step 30200/150000, Loss: 4.396444973945617, Test Loss: 4.391043245792389, LR: 0.0005, Elapsed Time: 709.53 seconds
Step 30300/150000, Loss: 4.3898271512985225, Test Loss: 4.395740985870361, LR: 0.0005, Elapsed Time: 711.79 seconds
Step 30400/150000, Loss: 4.393435115814209, Test Loss: 4.392382740974426, LR: 0.0005, Elapsed Time: 714.04 seconds
Step 30500/150000, Loss: 4.384532475471497, Test Loss: 4.392955183982849, LR: 0.0005, Elapsed Time: 716.30 seconds
Step 30600/150000, Loss: 4.3827527904510495, Test Loss: 4.392284631729126, LR: 0.0005, Elapsed Time: 718.55 seconds
Step 30700/150000, Loss: 4.384660849571228, Test Loss: 4.391201198101044, LR: 0.0005, Elapsed Time: 720.81 seconds
Step 30800/150000, Loss: 4.38840916633606, Test Loss: 4.3901514410972595, LR: 0.0005, Elapsed Time: 723.06 seconds
Step 30900/150000, Loss: 4.387351589202881, Test Loss: 4.38919723033905, LR: 0.0005, Elapsed Time: 725.31 seconds
Step 31000/150000, Loss: 4.383817443847656, Test Loss: 4.392574727535248, LR: 0.0005, Elapsed Time: 727.58 seconds
Step 31100/150000, Loss: 4.391881256103516, Test Loss: 4.389953255653381, LR: 0.0005, Elapsed Time: 729.86 seconds
Step 31200/150000, Loss: 4.387584595680237, Test Loss: 4.389747798442841, LR: 0.0005, Elapsed Time: 732.13 seconds
Step 31300/150000, Loss: 4.372762093544006, Test Loss: 4.388463377952576, LR: 0.0005, Elapsed Time: 734.40 seconds
Step 31400/150000, Loss: 4.373998460769653, Test Loss: 4.383456707000732, LR: 0.0005, Elapsed Time: 736.67 seconds
Step 31500/150000, Loss: 4.383644113540649, Test Loss: 4.387653470039368, LR: 0.0005, Elapsed Time: 738.93 seconds
Step 31600/150000, Loss: 4.382969341278076, Test Loss: 4.384958326816559, LR: 0.0005, Elapsed Time: 741.18 seconds
Step 31700/150000, Loss: 4.375899558067322, Test Loss: 4.385817348957062, LR: 0.0005, Elapsed Time: 743.43 seconds
Step 31800/150000, Loss: 4.374554219245911, Test Loss: 4.3795347809791565, LR: 0.0005, Elapsed Time: 745.69 seconds
Step 31900/150000, Loss: 4.37382342338562, Test Loss: 4.383820116519928, LR: 0.0005, Elapsed Time: 747.95 seconds
Step 32000/150000, Loss: 4.387607488632202, Test Loss: 4.379450142383575, LR: 0.0005, Elapsed Time: 750.22 seconds
Step 32100/150000, Loss: 4.380309562683106, Test Loss: 4.388272702693939, LR: 0.0005, Elapsed Time: 752.48 seconds
Step 32200/150000, Loss: 4.377749671936035, Test Loss: 4.379946351051331, LR: 0.0005, Elapsed Time: 754.73 seconds
Step 32300/150000, Loss: 4.387304334640503, Test Loss: 4.380756139755249, LR: 0.0005, Elapsed Time: 756.99 seconds
Step 32400/150000, Loss: 4.377603487968445, Test Loss: 4.377735614776611, LR: 0.0005, Elapsed Time: 759.25 seconds
Step 32500/150000, Loss: 4.380426626205445, Test Loss: 4.374357581138611, LR: 0.0005, Elapsed Time: 761.51 seconds
Step 32600/150000, Loss: 4.375449819564819, Test Loss: 4.377858996391296, LR: 0.0005, Elapsed Time: 763.76 seconds
Step 32700/150000, Loss: 4.380830783843994, Test Loss: 4.378900170326233, LR: 0.0005, Elapsed Time: 766.02 seconds
Step 32800/150000, Loss: 4.3901441860198975, Test Loss: 4.375666558742523, LR: 0.0005, Elapsed Time: 768.27 seconds
Step 32900/150000, Loss: 4.372995648384094, Test Loss: 4.3753708600997925, LR: 0.0005, Elapsed Time: 770.53 seconds
Step 33000/150000, Loss: 4.373090491294861, Test Loss: 4.377339065074921, LR: 0.0005, Elapsed Time: 772.79 seconds
Step 33100/150000, Loss: 4.361830382347107, Test Loss: 4.376699388027191, LR: 0.0005, Elapsed Time: 775.05 seconds
Step 33200/150000, Loss: 4.376027178764343, Test Loss: 4.376384794712067, LR: 0.0005, Elapsed Time: 777.30 seconds
Step 33300/150000, Loss: 4.371391725540161, Test Loss: 4.376210808753967, LR: 0.0005, Elapsed Time: 779.56 seconds
Step 33400/150000, Loss: 4.375049495697022, Test Loss: 4.374938786029816, LR: 0.0005, Elapsed Time: 781.82 seconds
Step 33500/150000, Loss: 4.3732981824874875, Test Loss: 4.3742517828941345, LR: 0.0005, Elapsed Time: 784.08 seconds
Step 33600/150000, Loss: 4.374722266197205, Test Loss: 4.373788356781006, LR: 0.0005, Elapsed Time: 786.34 seconds
Step 33700/150000, Loss: 4.365082569122315, Test Loss: 4.374330461025238, LR: 0.0005, Elapsed Time: 788.60 seconds
Step 33800/150000, Loss: 4.3672421169281, Test Loss: 4.377809166908264, LR: 0.0005, Elapsed Time: 790.86 seconds
Step 33900/150000, Loss: 4.374629430770874, Test Loss: 4.371380150318146, LR: 0.0005, Elapsed Time: 793.13 seconds
Step 34000/150000, Loss: 4.374588966369629, Test Loss: 4.370753049850464, LR: 0.0005, Elapsed Time: 795.38 seconds
Step 34100/150000, Loss: 4.366230864524841, Test Loss: 4.371456027030945, LR: 0.0005, Elapsed Time: 797.64 seconds
Step 34200/150000, Loss: 4.379716572761535, Test Loss: 4.369409501552582, LR: 0.0005, Elapsed Time: 799.90 seconds
Step 34300/150000, Loss: 4.361826810836792, Test Loss: 4.367761611938477, LR: 0.0005, Elapsed Time: 802.16 seconds
Step 34400/150000, Loss: 4.368473272323609, Test Loss: 4.367554426193237, LR: 0.0005, Elapsed Time: 804.42 seconds
Step 34500/150000, Loss: 4.370284457206726, Test Loss: 4.3722087144851685, LR: 0.0005, Elapsed Time: 806.68 seconds
Step 34600/150000, Loss: 4.359614334106445, Test Loss: 4.367729961872101, LR: 0.0005, Elapsed Time: 808.94 seconds
Step 34700/150000, Loss: 4.363624267578125, Test Loss: 4.366275787353516, LR: 0.0005, Elapsed Time: 811.20 seconds
Step 34800/150000, Loss: 4.365877966880799, Test Loss: 4.364705145359039, LR: 0.0005, Elapsed Time: 813.46 seconds
Step 34900/150000, Loss: 4.361302971839905, Test Loss: 4.3635223507881165, LR: 0.0005, Elapsed Time: 815.72 seconds
Step 35000/150000, Loss: 4.352422504425049, Test Loss: 4.364431381225586, LR: 0.0005, Elapsed Time: 817.98 seconds
Step 35100/150000, Loss: 4.36617959022522, Test Loss: 4.36428302526474, LR: 0.0005, Elapsed Time: 820.23 seconds
Step 35200/150000, Loss: 4.370190935134888, Test Loss: 4.361592948436737, LR: 0.0005, Elapsed Time: 822.48 seconds
Step 35300/150000, Loss: 4.3569950675964355, Test Loss: 4.3648770451545715, LR: 0.0005, Elapsed Time: 824.74 seconds
Step 35400/150000, Loss: 4.353726668357849, Test Loss: 4.362483501434326, LR: 0.0005, Elapsed Time: 826.99 seconds
Step 35500/150000, Loss: 4.351029553413391, Test Loss: 4.358498394489288, LR: 0.0005, Elapsed Time: 829.25 seconds
Step 35600/150000, Loss: 4.360661001205444, Test Loss: 4.362089097499847, LR: 0.0005, Elapsed Time: 831.51 seconds
Step 35700/150000, Loss: 4.3541097784042355, Test Loss: 4.3572511076927185, LR: 0.0005, Elapsed Time: 833.77 seconds
Step 35800/150000, Loss: 4.3574042224884035, Test Loss: 4.3599361181259155, LR: 0.0005, Elapsed Time: 836.02 seconds
Step 35900/150000, Loss: 4.364496903419495, Test Loss: 4.357215940952301, LR: 0.0005, Elapsed Time: 838.27 seconds
Step 36000/150000, Loss: 4.352040510177613, Test Loss: 4.359864771366119, LR: 0.0005, Elapsed Time: 840.53 seconds
Step 36100/150000, Loss: 4.35408579826355, Test Loss: 4.359675645828247, LR: 0.0005, Elapsed Time: 842.78 seconds
Step 36200/150000, Loss: 4.358658604621887, Test Loss: 4.360761523246765, LR: 0.0005, Elapsed Time: 845.04 seconds
Step 36300/150000, Loss: 4.360071496963501, Test Loss: 4.358421444892883, LR: 0.0005, Elapsed Time: 847.29 seconds
Step 36400/150000, Loss: 4.347355356216431, Test Loss: 4.358124673366547, LR: 0.0005, Elapsed Time: 849.55 seconds
Step 36500/150000, Loss: 4.353447651863098, Test Loss: 4.35642808675766, LR: 0.0005, Elapsed Time: 851.81 seconds
Step 36600/150000, Loss: 4.341551904678345, Test Loss: 4.35664302110672, LR: 0.0005, Elapsed Time: 854.07 seconds
Step 36700/150000, Loss: 4.34647424697876, Test Loss: 4.355391681194305, LR: 0.0005, Elapsed Time: 856.33 seconds
Step 36800/150000, Loss: 4.355445990562439, Test Loss: 4.354495346546173, LR: 0.0005, Elapsed Time: 858.58 seconds
Step 36900/150000, Loss: 4.348022379875183, Test Loss: 4.356687784194946, LR: 0.0005, Elapsed Time: 860.85 seconds
Step 37000/150000, Loss: 4.348324241638184, Test Loss: 4.355081081390381, LR: 0.0005, Elapsed Time: 863.10 seconds
Step 37100/150000, Loss: 4.350793232917786, Test Loss: 4.354921996593475, LR: 0.0005, Elapsed Time: 865.36 seconds
Step 37200/150000, Loss: 4.33569522857666, Test Loss: 4.354163229465485, LR: 0.0005, Elapsed Time: 867.61 seconds
Step 37300/150000, Loss: 4.350907073020935, Test Loss: 4.351807236671448, LR: 0.0005, Elapsed Time: 869.87 seconds
Step 37400/150000, Loss: 4.354293003082275, Test Loss: 4.348916828632355, LR: 0.0005, Elapsed Time: 872.13 seconds
Step 37500/150000, Loss: 4.337583327293396, Test Loss: 4.351940929889679, LR: 0.0005, Elapsed Time: 874.38 seconds
Step 37600/150000, Loss: 4.357893643379211, Test Loss: 4.353776633739471, LR: 0.0005, Elapsed Time: 876.63 seconds
Step 37700/150000, Loss: 4.346215391159058, Test Loss: 4.350262761116028, LR: 0.0005, Elapsed Time: 878.90 seconds
Step 37800/150000, Loss: 4.33481520652771, Test Loss: 4.352648735046387, LR: 0.0005, Elapsed Time: 881.16 seconds
Step 37900/150000, Loss: 4.333804082870484, Test Loss: 4.3532180190086365, LR: 0.0005, Elapsed Time: 883.41 seconds
Step 38000/150000, Loss: 4.351818733215332, Test Loss: 4.349226415157318, LR: 0.0005, Elapsed Time: 885.67 seconds
Step 38100/150000, Loss: 4.339835095405578, Test Loss: 4.348971486091614, LR: 0.0005, Elapsed Time: 887.93 seconds
Step 38200/150000, Loss: 4.347683038711548, Test Loss: 4.347927451133728, LR: 0.0005, Elapsed Time: 890.20 seconds
Step 38300/150000, Loss: 4.3447941875457765, Test Loss: 4.343221724033356, LR: 0.0005, Elapsed Time: 892.45 seconds
Step 38400/150000, Loss: 4.340312194824219, Test Loss: 4.340498566627502, LR: 0.0005, Elapsed Time: 894.72 seconds
Step 38500/150000, Loss: 4.349240369796753, Test Loss: 4.340567767620087, LR: 0.0005, Elapsed Time: 896.97 seconds
Step 38600/150000, Loss: 4.34198830127716, Test Loss: 4.342031240463257, LR: 0.0005, Elapsed Time: 899.22 seconds
Step 38700/150000, Loss: 4.3499065160751345, Test Loss: 4.343287229537964, LR: 0.0005, Elapsed Time: 901.48 seconds
Step 38800/150000, Loss: 4.334733047485352, Test Loss: 4.342368304729462, LR: 0.0005, Elapsed Time: 903.73 seconds
Step 38900/150000, Loss: 4.331156539916992, Test Loss: 4.342459321022034, LR: 0.0005, Elapsed Time: 905.99 seconds
Step 39000/150000, Loss: 4.345631823539734, Test Loss: 4.344482004642487, LR: 0.0005, Elapsed Time: 908.24 seconds
Step 39100/150000, Loss: 4.336893978118897, Test Loss: 4.343028366565704, LR: 0.0005, Elapsed Time: 910.51 seconds
Step 39200/150000, Loss: 4.3492778778076175, Test Loss: 4.339098334312439, LR: 0.0005, Elapsed Time: 912.77 seconds
Step 39300/150000, Loss: 4.3380689573287965, Test Loss: 4.344177007675171, LR: 0.0005, Elapsed Time: 915.03 seconds
Step 39400/150000, Loss: 4.320622668266297, Test Loss: 4.339345216751099, LR: 0.0005, Elapsed Time: 917.29 seconds
Step 39500/150000, Loss: 4.32775728225708, Test Loss: 4.341570615768433, LR: 0.0005, Elapsed Time: 919.56 seconds
Step 39600/150000, Loss: 4.338237323760986, Test Loss: 4.339925050735474, LR: 0.0005, Elapsed Time: 921.82 seconds
Step 39700/150000, Loss: 4.344900646209717, Test Loss: 4.340912103652954, LR: 0.0005, Elapsed Time: 924.08 seconds
Step 39800/150000, Loss: 4.33304072856903, Test Loss: 4.3389071226119995, LR: 0.0005, Elapsed Time: 926.34 seconds
Step 39900/150000, Loss: 4.339319353103638, Test Loss: 4.339995861053467, LR: 0.0005, Elapsed Time: 928.59 seconds
Step 40000/150000, Loss: 4.331869478225708, Test Loss: 4.339958131313324, LR: 0.0005, Elapsed Time: 930.85 seconds
Step 40100/150000, Loss: 4.331388268470764, Test Loss: 4.3388766050338745, LR: 0.0005, Elapsed Time: 933.11 seconds
Step 40200/150000, Loss: 4.324614334106445, Test Loss: 4.332013368606567, LR: 0.0005, Elapsed Time: 935.37 seconds
Step 40300/150000, Loss: 4.331877217292786, Test Loss: 4.335694432258606, LR: 0.0005, Elapsed Time: 937.63 seconds
Step 40400/150000, Loss: 4.341790790557861, Test Loss: 4.332803547382355, LR: 0.0005, Elapsed Time: 939.90 seconds
Step 40500/150000, Loss: 4.340521183013916, Test Loss: 4.333991527557373, LR: 0.0005, Elapsed Time: 942.17 seconds
Step 40600/150000, Loss: 4.332984075546265, Test Loss: 4.330583870410919, LR: 0.0005, Elapsed Time: 944.43 seconds
Step 40700/150000, Loss: 4.339196267127991, Test Loss: 4.334278106689453, LR: 0.0005, Elapsed Time: 946.69 seconds
Step 40800/150000, Loss: 4.3330592441558835, Test Loss: 4.331490576267242, LR: 0.0005, Elapsed Time: 948.94 seconds
Step 40900/150000, Loss: 4.3247833251953125, Test Loss: 4.336824655532837, LR: 0.0005, Elapsed Time: 951.20 seconds
Step 41000/150000, Loss: 4.3283611059188845, Test Loss: 4.333289921283722, LR: 0.0005, Elapsed Time: 953.46 seconds
Step 41100/150000, Loss: 4.325459513664246, Test Loss: 4.331052601337433, LR: 0.0005, Elapsed Time: 955.72 seconds
Step 41200/150000, Loss: 4.32569839477539, Test Loss: 4.333673655986786, LR: 0.0005, Elapsed Time: 957.98 seconds
Step 41300/150000, Loss: 4.328653841018677, Test Loss: 4.333842694759369, LR: 0.0005, Elapsed Time: 960.24 seconds
Step 41400/150000, Loss: 4.327797307968139, Test Loss: 4.33536958694458, LR: 0.0005, Elapsed Time: 962.50 seconds
Step 41500/150000, Loss: 4.322942748069763, Test Loss: 4.333994209766388, LR: 0.0005, Elapsed Time: 964.76 seconds
Step 41600/150000, Loss: 4.326237897872925, Test Loss: 4.331892788410187, LR: 0.0005, Elapsed Time: 967.02 seconds
Step 41700/150000, Loss: 4.327303881645203, Test Loss: 4.329919993877411, LR: 0.0005, Elapsed Time: 969.28 seconds
Step 41800/150000, Loss: 4.329608259201049, Test Loss: 4.328213632106781, LR: 0.0005, Elapsed Time: 971.55 seconds
Step 41900/150000, Loss: 4.334939794540405, Test Loss: 4.330727398395538, LR: 0.0005, Elapsed Time: 973.81 seconds
Step 42000/150000, Loss: 4.318497018814087, Test Loss: 4.32933121919632, LR: 0.0005, Elapsed Time: 976.07 seconds
Step 42100/150000, Loss: 4.317577996253967, Test Loss: 4.329457461833954, LR: 0.0005, Elapsed Time: 978.33 seconds
Step 42200/150000, Loss: 4.325277667045594, Test Loss: 4.330113470554352, LR: 0.0005, Elapsed Time: 980.60 seconds
Step 42300/150000, Loss: 4.321885957717895, Test Loss: 4.328914165496826, LR: 0.0005, Elapsed Time: 982.86 seconds
Step 42400/150000, Loss: 4.32630690574646, Test Loss: 4.330440580844879, LR: 0.0005, Elapsed Time: 985.12 seconds
Step 42500/150000, Loss: 4.316565232276917, Test Loss: 4.326734244823456, LR: 0.0005, Elapsed Time: 987.38 seconds
Step 42600/150000, Loss: 4.32872049331665, Test Loss: 4.326007187366486, LR: 0.0005, Elapsed Time: 989.64 seconds
Step 42700/150000, Loss: 4.315379729270935, Test Loss: 4.331083834171295, LR: 0.0005, Elapsed Time: 991.91 seconds
Step 42800/150000, Loss: 4.316136598587036, Test Loss: 4.32831746339798, LR: 0.0005, Elapsed Time: 994.16 seconds
Step 42900/150000, Loss: 4.3366562938690185, Test Loss: 4.327862918376923, LR: 0.0005, Elapsed Time: 996.41 seconds
Step 43000/150000, Loss: 4.321315011978149, Test Loss: 4.324210584163666, LR: 0.0005, Elapsed Time: 998.67 seconds
Step 43100/150000, Loss: 4.331274108886719, Test Loss: 4.323949575424194, LR: 0.0005, Elapsed Time: 1000.94 seconds
Step 43200/150000, Loss: 4.325090909004212, Test Loss: 4.323664605617523, LR: 0.0005, Elapsed Time: 1003.20 seconds
Step 43300/150000, Loss: 4.3239258813858035, Test Loss: 4.326116740703583, LR: 0.0005, Elapsed Time: 1005.46 seconds
Step 43400/150000, Loss: 4.326781797409057, Test Loss: 4.32248067855835, LR: 0.0005, Elapsed Time: 1007.72 seconds
Step 43500/150000, Loss: 4.315217170715332, Test Loss: 4.3232086300849915, LR: 0.0005, Elapsed Time: 1009.99 seconds
Step 43600/150000, Loss: 4.322398104667664, Test Loss: 4.3230814933776855, LR: 0.0005, Elapsed Time: 1012.25 seconds
Step 43700/150000, Loss: 4.317909164428711, Test Loss: 4.323676407337189, LR: 0.0005, Elapsed Time: 1014.52 seconds
Step 43800/150000, Loss: 4.318664588928223, Test Loss: 4.319726526737213, LR: 0.0005, Elapsed Time: 1016.79 seconds
Step 43900/150000, Loss: 4.31348117351532, Test Loss: 4.319318115711212, LR: 0.0005, Elapsed Time: 1019.05 seconds
Step 44000/150000, Loss: 4.315704684257508, Test Loss: 4.320447027683258, LR: 0.0005, Elapsed Time: 1021.31 seconds
Step 44100/150000, Loss: 4.311765160560608, Test Loss: 4.319638609886169, LR: 0.0005, Elapsed Time: 1023.57 seconds
Step 44200/150000, Loss: 4.3189832305908205, Test Loss: 4.317598223686218, LR: 0.0005, Elapsed Time: 1025.84 seconds
Step 44300/150000, Loss: 4.313728837966919, Test Loss: 4.317281186580658, LR: 0.0005, Elapsed Time: 1028.10 seconds
Step 44400/150000, Loss: 4.316393675804139, Test Loss: 4.318271517753601, LR: 0.0005, Elapsed Time: 1030.36 seconds
Step 44500/150000, Loss: 4.317489829063415, Test Loss: 4.318049609661102, LR: 0.0005, Elapsed Time: 1032.62 seconds
Step 44600/150000, Loss: 4.312700819969177, Test Loss: 4.3174044489860535, LR: 0.0005, Elapsed Time: 1034.88 seconds
Step 44700/150000, Loss: 4.307196373939514, Test Loss: 4.318657040596008, LR: 0.0005, Elapsed Time: 1037.14 seconds
Step 44800/150000, Loss: 4.292447304725647, Test Loss: 4.3172966837883, LR: 0.0005, Elapsed Time: 1039.40 seconds
Step 44900/150000, Loss: 4.312977690696716, Test Loss: 4.316750347614288, LR: 0.0005, Elapsed Time: 1041.66 seconds
Step 45000/150000, Loss: 4.307983584403992, Test Loss: 4.318207144737244, LR: 0.0005, Elapsed Time: 1043.93 seconds
Step 45100/150000, Loss: 4.300743479728698, Test Loss: 4.316190123558044, LR: 0.0005, Elapsed Time: 1046.19 seconds
Step 45200/150000, Loss: 4.318858017921448, Test Loss: 4.316973865032196, LR: 0.0005, Elapsed Time: 1048.45 seconds
Step 45300/150000, Loss: 4.307809929847718, Test Loss: 4.313174486160278, LR: 0.0005, Elapsed Time: 1050.71 seconds
Step 45400/150000, Loss: 4.313835868835449, Test Loss: 4.314944744110107, LR: 0.0005, Elapsed Time: 1052.97 seconds
Step 45500/150000, Loss: 4.315958552360534, Test Loss: 4.315226852893829, LR: 0.0005, Elapsed Time: 1055.23 seconds
Step 45600/150000, Loss: 4.303682713508606, Test Loss: 4.315033733844757, LR: 0.0005, Elapsed Time: 1057.48 seconds
Step 45700/150000, Loss: 4.298544964790344, Test Loss: 4.316842675209045, LR: 0.0005, Elapsed Time: 1059.73 seconds
Step 45800/150000, Loss: 4.317230377197266, Test Loss: 4.314151167869568, LR: 0.0005, Elapsed Time: 1061.99 seconds
Step 45900/150000, Loss: 4.305426940917969, Test Loss: 4.312131762504578, LR: 0.0005, Elapsed Time: 1064.25 seconds
Step 46000/150000, Loss: 4.303668761253357, Test Loss: 4.31278783082962, LR: 0.0005, Elapsed Time: 1066.50 seconds
Step 46100/150000, Loss: 4.3128831624984745, Test Loss: 4.314027309417725, LR: 0.0005, Elapsed Time: 1068.76 seconds
Step 46200/150000, Loss: 4.307659468650818, Test Loss: 4.31465870141983, LR: 0.0005, Elapsed Time: 1071.04 seconds
Step 46300/150000, Loss: 4.302728533744812, Test Loss: 4.315938174724579, LR: 0.0005, Elapsed Time: 1073.29 seconds
Step 46400/150000, Loss: 4.302399311065674, Test Loss: 4.311546146869659, LR: 0.0005, Elapsed Time: 1075.56 seconds
Step 46500/150000, Loss: 4.299297099113464, Test Loss: 4.312092185020447, LR: 0.0005, Elapsed Time: 1077.82 seconds
Step 46600/150000, Loss: 4.298504881858825, Test Loss: 4.312967240810394, LR: 0.0005, Elapsed Time: 1080.08 seconds
Step 46700/150000, Loss: 4.2970996499061584, Test Loss: 4.3109259605407715, LR: 0.0005, Elapsed Time: 1082.34 seconds
Step 46800/150000, Loss: 4.295693879127502, Test Loss: 4.308514058589935, LR: 0.0005, Elapsed Time: 1084.60 seconds
Step 46900/150000, Loss: 4.30417311668396, Test Loss: 4.307890772819519, LR: 0.0005, Elapsed Time: 1086.86 seconds
Step 47000/150000, Loss: 4.3118449354171755, Test Loss: 4.3115445375442505, LR: 0.0005, Elapsed Time: 1089.12 seconds
Step 47100/150000, Loss: 4.304867763519287, Test Loss: 4.311034619808197, LR: 0.0005, Elapsed Time: 1091.38 seconds
Step 47200/150000, Loss: 4.311420335769653, Test Loss: 4.309537410736084, LR: 0.0005, Elapsed Time: 1093.63 seconds
Step 47300/150000, Loss: 4.303043909072876, Test Loss: 4.309673309326172, LR: 0.0005, Elapsed Time: 1095.89 seconds
Step 47400/150000, Loss: 4.300684313774109, Test Loss: 4.30802047252655, LR: 0.0005, Elapsed Time: 1098.15 seconds
Step 47500/150000, Loss: 4.303783621788025, Test Loss: 4.307301819324493, LR: 0.0005, Elapsed Time: 1100.40 seconds
Step 47600/150000, Loss: 4.309947166442871, Test Loss: 4.309576570987701, LR: 0.0005, Elapsed Time: 1102.67 seconds
Step 47700/150000, Loss: 4.297784986495972, Test Loss: 4.305806517601013, LR: 0.0005, Elapsed Time: 1104.93 seconds
Step 47800/150000, Loss: 4.304680414199829, Test Loss: 4.306246221065521, LR: 0.0005, Elapsed Time: 1107.18 seconds
Step 47900/150000, Loss: 4.295728740692138, Test Loss: 4.304622054100037, LR: 0.0005, Elapsed Time: 1109.45 seconds
Step 48000/150000, Loss: 4.293417520523072, Test Loss: 4.306078851222992, LR: 0.0005, Elapsed Time: 1111.71 seconds
Step 48100/150000, Loss: 4.306869530677796, Test Loss: 4.305638015270233, LR: 0.0005, Elapsed Time: 1113.97 seconds
Step 48200/150000, Loss: 4.288901648521423, Test Loss: 4.308156251907349, LR: 0.0005, Elapsed Time: 1116.23 seconds
Step 48300/150000, Loss: 4.303158226013184, Test Loss: 4.30575168132782, LR: 0.0005, Elapsed Time: 1118.49 seconds
Step 48400/150000, Loss: 4.296811404228211, Test Loss: 4.3050936460494995, LR: 0.0005, Elapsed Time: 1120.76 seconds
Step 48500/150000, Loss: 4.300083947181702, Test Loss: 4.306912302970886, LR: 0.0005, Elapsed Time: 1123.01 seconds
Step 48600/150000, Loss: 4.300790305137634, Test Loss: 4.307137668132782, LR: 0.0005, Elapsed Time: 1125.27 seconds
Step 48700/150000, Loss: 4.3039186000823975, Test Loss: 4.30562287569046, LR: 0.0005, Elapsed Time: 1127.53 seconds
Step 48800/150000, Loss: 4.2847586297988896, Test Loss: 4.302871465682983, LR: 0.0005, Elapsed Time: 1129.79 seconds
Step 48900/150000, Loss: 4.29343433380127, Test Loss: 4.304650902748108, LR: 0.0005, Elapsed Time: 1132.06 seconds
Step 49000/150000, Loss: 4.294832220077515, Test Loss: 4.304498374462128, LR: 0.0005, Elapsed Time: 1134.31 seconds
Step 49100/150000, Loss: 4.292713112831116, Test Loss: 4.305823385715485, LR: 0.0005, Elapsed Time: 1136.58 seconds
Step 49200/150000, Loss: 4.299356880187989, Test Loss: 4.305614709854126, LR: 0.0005, Elapsed Time: 1138.84 seconds
Step 49300/150000, Loss: 4.293326869010925, Test Loss: 4.304232716560364, LR: 0.0005, Elapsed Time: 1141.10 seconds
Step 49400/150000, Loss: 4.289664545059204, Test Loss: 4.304398357868195, LR: 0.0005, Elapsed Time: 1143.36 seconds
Step 49500/150000, Loss: 4.296453142166138, Test Loss: 4.306224703788757, LR: 0.0005, Elapsed Time: 1145.63 seconds
Step 49600/150000, Loss: 4.283649091720581, Test Loss: 4.30394184589386, LR: 0.0005, Elapsed Time: 1147.90 seconds
Step 49700/150000, Loss: 4.297687578201294, Test Loss: 4.305764853954315, LR: 0.0005, Elapsed Time: 1150.16 seconds
Step 49800/150000, Loss: 4.302246584892273, Test Loss: 4.303353011608124, LR: 0.0005, Elapsed Time: 1152.42 seconds
Step 49900/150000, Loss: 4.287914881706238, Test Loss: 4.300215542316437, LR: 0.0005, Elapsed Time: 1154.68 seconds
Step 50000/150000, Loss: 4.286743950843811, Test Loss: 4.300099670886993, LR: 0.0005, Elapsed Time: 1156.95 seconds
Saving model checkpoint at step 50000
Step 50100/150000, Loss: 4.294740209579468, Test Loss: 4.300348341464996, LR: 0.0005, Elapsed Time: 1159.29 seconds
Step 50200/150000, Loss: 4.28968912601471, Test Loss: 4.302231550216675, LR: 0.0005, Elapsed Time: 1161.56 seconds
Step 50300/150000, Loss: 4.2990798330307, Test Loss: 4.300710141658783, LR: 0.0005, Elapsed Time: 1163.81 seconds
Step 50400/150000, Loss: 4.2895826244354245, Test Loss: 4.299021005630493, LR: 0.0005, Elapsed Time: 1166.07 seconds
Step 50500/150000, Loss: 4.285569505691528, Test Loss: 4.29866749048233, LR: 0.0005, Elapsed Time: 1168.32 seconds
Step 50600/150000, Loss: 4.2847246742248535, Test Loss: 4.299402058124542, LR: 0.0005, Elapsed Time: 1170.58 seconds
Step 50700/150000, Loss: 4.286179513931274, Test Loss: 4.295972943305969, LR: 0.0005, Elapsed Time: 1172.84 seconds
Step 50800/150000, Loss: 4.285420761108399, Test Loss: 4.297976493835449, LR: 0.0005, Elapsed Time: 1175.10 seconds
Step 50900/150000, Loss: 4.284252314567566, Test Loss: 4.3009761571884155, LR: 0.0005, Elapsed Time: 1177.36 seconds
Step 51000/150000, Loss: 4.284822101593018, Test Loss: 4.29770565032959, LR: 0.0005, Elapsed Time: 1179.60 seconds
Step 51100/150000, Loss: 4.274978652000427, Test Loss: 4.29769504070282, LR: 0.0005, Elapsed Time: 1181.86 seconds
Step 51200/150000, Loss: 4.298515391349793, Test Loss: 4.299999713897705, LR: 0.0005, Elapsed Time: 1184.11 seconds
Step 51300/150000, Loss: 4.276268835067749, Test Loss: 4.298844516277313, LR: 0.0005, Elapsed Time: 1186.36 seconds
Step 51400/150000, Loss: 4.28612368106842, Test Loss: 4.2958661913871765, LR: 0.0005, Elapsed Time: 1188.61 seconds
Step 51500/150000, Loss: 4.289710254669189, Test Loss: 4.29628449678421, LR: 0.0005, Elapsed Time: 1190.86 seconds
Step 51600/150000, Loss: 4.276264214515686, Test Loss: 4.293637216091156, LR: 0.0005, Elapsed Time: 1193.11 seconds
Step 51700/150000, Loss: 4.282319388389587, Test Loss: 4.293144941329956, LR: 0.0005, Elapsed Time: 1195.37 seconds
Step 51800/150000, Loss: 4.276028394699097, Test Loss: 4.296373128890991, LR: 0.0005, Elapsed Time: 1197.63 seconds
Step 51900/150000, Loss: 4.286621956825257, Test Loss: 4.296044588088989, LR: 0.0005, Elapsed Time: 1199.89 seconds
Step 52000/150000, Loss: 4.281609315872192, Test Loss: 4.29373174905777, LR: 0.0005, Elapsed Time: 1202.14 seconds
Step 52100/150000, Loss: 4.2929958248138425, Test Loss: 4.293322920799255, LR: 0.0005, Elapsed Time: 1204.39 seconds
Step 52200/150000, Loss: 4.2763641691207885, Test Loss: 4.292064666748047, LR: 0.0005, Elapsed Time: 1206.65 seconds
Step 52300/150000, Loss: 4.2873149061203, Test Loss: 4.2942036390304565, LR: 0.0005, Elapsed Time: 1208.91 seconds
Step 52400/150000, Loss: 4.27870436668396, Test Loss: 4.293150901794434, LR: 0.0005, Elapsed Time: 1211.17 seconds
Step 52500/150000, Loss: 4.27475821018219, Test Loss: 4.294040203094482, LR: 0.0005, Elapsed Time: 1213.42 seconds
Step 52600/150000, Loss: 4.279656143188476, Test Loss: 4.291780233383179, LR: 0.0005, Elapsed Time: 1215.68 seconds
Step 52700/150000, Loss: 4.284947547912598, Test Loss: 4.2921406626701355, LR: 0.0005, Elapsed Time: 1217.94 seconds
Step 52800/150000, Loss: 4.274126410484314, Test Loss: 4.291645884513855, LR: 0.0005, Elapsed Time: 1220.19 seconds
Step 52900/150000, Loss: 4.2855939769744875, Test Loss: 4.288549721240997, LR: 0.0005, Elapsed Time: 1222.46 seconds
Step 53000/150000, Loss: 4.288906826972961, Test Loss: 4.2872984409332275, LR: 0.0005, Elapsed Time: 1224.72 seconds
Step 53100/150000, Loss: 4.277085776329041, Test Loss: 4.289853036403656, LR: 0.0005, Elapsed Time: 1226.98 seconds
Step 53200/150000, Loss: 4.2650732946395875, Test Loss: 4.290336012840271, LR: 0.0005, Elapsed Time: 1229.23 seconds
Step 53300/150000, Loss: 4.282463402748108, Test Loss: 4.291793465614319, LR: 0.0005, Elapsed Time: 1231.49 seconds
Step 53400/150000, Loss: 4.277775783538818, Test Loss: 4.288947403430939, LR: 0.0005, Elapsed Time: 1233.74 seconds
Step 53500/150000, Loss: 4.2813373470306395, Test Loss: 4.289871454238892, LR: 0.0005, Elapsed Time: 1235.99 seconds
Step 53600/150000, Loss: 4.272387638092041, Test Loss: 4.2893460392951965, LR: 0.0005, Elapsed Time: 1238.25 seconds
Step 53700/150000, Loss: 4.271070289611816, Test Loss: 4.287985622882843, LR: 0.0005, Elapsed Time: 1240.50 seconds
Step 53800/150000, Loss: 4.277113018035888, Test Loss: 4.285191774368286, LR: 0.0005, Elapsed Time: 1242.75 seconds
Step 53900/150000, Loss: 4.279895906448364, Test Loss: 4.287278652191162, LR: 0.0005, Elapsed Time: 1245.01 seconds
Step 54000/150000, Loss: 4.289020504951477, Test Loss: 4.2908695936203, LR: 0.0005, Elapsed Time: 1247.26 seconds
Step 54100/150000, Loss: 4.2718474531173705, Test Loss: 4.286797940731049, LR: 0.0005, Elapsed Time: 1249.52 seconds
Step 54200/150000, Loss: 4.278998055458069, Test Loss: 4.289202332496643, LR: 0.0005, Elapsed Time: 1251.78 seconds
Step 54300/150000, Loss: 4.284941325187683, Test Loss: 4.28798234462738, LR: 0.0005, Elapsed Time: 1254.04 seconds
Step 54400/150000, Loss: 4.279777746200562, Test Loss: 4.284647881984711, LR: 0.0005, Elapsed Time: 1256.30 seconds
Step 54500/150000, Loss: 4.278353910446167, Test Loss: 4.2842806577682495, LR: 0.0005, Elapsed Time: 1258.55 seconds
Step 54600/150000, Loss: 4.279137897491455, Test Loss: 4.282984972000122, LR: 0.0005, Elapsed Time: 1260.81 seconds
Step 54700/150000, Loss: 4.291395797729492, Test Loss: 4.284887790679932, LR: 0.0005, Elapsed Time: 1263.07 seconds
Step 54800/150000, Loss: 4.2699970579147335, Test Loss: 4.283645868301392, LR: 0.0005, Elapsed Time: 1265.32 seconds
Step 54900/150000, Loss: 4.274279046058655, Test Loss: 4.285517752170563, LR: 0.0005, Elapsed Time: 1267.58 seconds
Step 55000/150000, Loss: 4.273794631958008, Test Loss: 4.286359906196594, LR: 0.0005, Elapsed Time: 1269.83 seconds
Step 55100/150000, Loss: 4.272040634155274, Test Loss: 4.28643536567688, LR: 0.0005, Elapsed Time: 1272.09 seconds
Step 55200/150000, Loss: 4.276530518531799, Test Loss: 4.286962449550629, LR: 0.0005, Elapsed Time: 1274.34 seconds
Step 55300/150000, Loss: 4.2793438482284545, Test Loss: 4.282910645008087, LR: 0.0005, Elapsed Time: 1276.59 seconds
Step 55400/150000, Loss: 4.281969718933105, Test Loss: 4.28590327501297, LR: 0.0005, Elapsed Time: 1278.85 seconds
Step 55500/150000, Loss: 4.271054005622863, Test Loss: 4.282890796661377, LR: 0.0005, Elapsed Time: 1281.12 seconds
Step 55600/150000, Loss: 4.271356587409973, Test Loss: 4.284083545207977, LR: 0.0005, Elapsed Time: 1283.39 seconds
Step 55700/150000, Loss: 4.2747852325439455, Test Loss: 4.285087823867798, LR: 0.0005, Elapsed Time: 1285.65 seconds
Step 55800/150000, Loss: 4.26105140209198, Test Loss: 4.256367921829224, LR: 0.00015, Elapsed Time: 1287.92 seconds
Step 55900/150000, Loss: 4.252844052314758, Test Loss: 4.250095009803772, LR: 0.00015, Elapsed Time: 1290.21 seconds
Step 56000/150000, Loss: 4.2438448762893675, Test Loss: 4.245615065097809, LR: 0.00015, Elapsed Time: 1292.51 seconds
Step 56100/150000, Loss: 4.240270538330078, Test Loss: 4.243906319141388, LR: 0.00015, Elapsed Time: 1294.79 seconds
Step 56200/150000, Loss: 4.225126395225525, Test Loss: 4.241440773010254, LR: 0.00015, Elapsed Time: 1297.07 seconds
Step 56300/150000, Loss: 4.234741721153259, Test Loss: 4.238715648651123, LR: 0.00015, Elapsed Time: 1299.35 seconds
Step 56400/150000, Loss: 4.234879446029663, Test Loss: 4.2398247718811035, LR: 0.00015, Elapsed Time: 1301.67 seconds
Step 56500/150000, Loss: 4.229254274368286, Test Loss: 4.238689839839935, LR: 0.00015, Elapsed Time: 1304.02 seconds
Step 56600/150000, Loss: 4.224518599510193, Test Loss: 4.235697388648987, LR: 0.00015, Elapsed Time: 1306.36 seconds
Step 56700/150000, Loss: 4.229177308082581, Test Loss: 4.2341548800468445, LR: 0.00015, Elapsed Time: 1308.71 seconds
Step 56800/150000, Loss: 4.217815103530884, Test Loss: 4.232803821563721, LR: 0.00015, Elapsed Time: 1311.05 seconds
Step 56900/150000, Loss: 4.217165818214417, Test Loss: 4.2327258586883545, LR: 0.00015, Elapsed Time: 1313.40 seconds
Step 57000/150000, Loss: 4.224095649719239, Test Loss: 4.232979476451874, LR: 0.00015, Elapsed Time: 1315.74 seconds
Step 57100/150000, Loss: 4.233536591529846, Test Loss: 4.231934428215027, LR: 0.00015, Elapsed Time: 1318.02 seconds
Step 57200/150000, Loss: 4.220723724365234, Test Loss: 4.231417179107666, LR: 0.00015, Elapsed Time: 1320.31 seconds
Step 57300/150000, Loss: 4.2090630626678465, Test Loss: 4.2289522886276245, LR: 0.00015, Elapsed Time: 1322.59 seconds
Step 57400/150000, Loss: 4.217414813041687, Test Loss: 4.228771507740021, LR: 0.00015, Elapsed Time: 1324.87 seconds
Step 57500/150000, Loss: 4.212612953186035, Test Loss: 4.228639662265778, LR: 0.00015, Elapsed Time: 1327.20 seconds
Step 57600/150000, Loss: 4.219487075805664, Test Loss: 4.226571977138519, LR: 0.00015, Elapsed Time: 1329.48 seconds
Step 57700/150000, Loss: 4.217920389175415, Test Loss: 4.2268940806388855, LR: 0.00015, Elapsed Time: 1331.75 seconds
Step 57800/150000, Loss: 4.217710056304932, Test Loss: 4.227491021156311, LR: 0.00015, Elapsed Time: 1334.01 seconds
Step 57900/150000, Loss: 4.218990325927734, Test Loss: 4.226440668106079, LR: 0.00015, Elapsed Time: 1336.27 seconds
Step 58000/150000, Loss: 4.213360466957092, Test Loss: 4.226251065731049, LR: 0.00015, Elapsed Time: 1338.52 seconds
Step 58100/150000, Loss: 4.211919903755188, Test Loss: 4.22649621963501, LR: 0.00015, Elapsed Time: 1340.80 seconds
Step 58200/150000, Loss: 4.226600728034973, Test Loss: 4.22525155544281, LR: 0.00015, Elapsed Time: 1343.10 seconds
Step 58300/150000, Loss: 4.207865252494812, Test Loss: 4.225117564201355, LR: 0.00015, Elapsed Time: 1345.40 seconds
Step 58400/150000, Loss: 4.200153794288635, Test Loss: 4.224831163883209, LR: 0.00015, Elapsed Time: 1347.70 seconds
Step 58500/150000, Loss: 4.2101742458343505, Test Loss: 4.224505662918091, LR: 0.00015, Elapsed Time: 1350.03 seconds
Step 58600/150000, Loss: 4.212024908065796, Test Loss: 4.223738431930542, LR: 0.00015, Elapsed Time: 1352.33 seconds
Step 58700/150000, Loss: 4.207591762542725, Test Loss: 4.222892165184021, LR: 0.00015, Elapsed Time: 1354.63 seconds
Step 58800/150000, Loss: 4.213626570701599, Test Loss: 4.222259938716888, LR: 0.00015, Elapsed Time: 1356.93 seconds
Step 58900/150000, Loss: 4.2063585472106935, Test Loss: 4.222109913825989, LR: 0.00015, Elapsed Time: 1359.22 seconds
Step 59000/150000, Loss: 4.206003046035766, Test Loss: 4.2226569056510925, LR: 0.00015, Elapsed Time: 1361.55 seconds
Step 59100/150000, Loss: 4.199745130538941, Test Loss: 4.22198611497879, LR: 0.00015, Elapsed Time: 1363.91 seconds
Step 59200/150000, Loss: 4.222770915031433, Test Loss: 4.220167696475983, LR: 0.00015, Elapsed Time: 1366.21 seconds
Step 59300/150000, Loss: 4.207590050697327, Test Loss: 4.220933794975281, LR: 0.00015, Elapsed Time: 1368.51 seconds
Step 59400/150000, Loss: 4.2070338153839115, Test Loss: 4.220721244812012, LR: 0.00015, Elapsed Time: 1370.81 seconds
Step 59500/150000, Loss: 4.2150557279586796, Test Loss: 4.221657812595367, LR: 0.00015, Elapsed Time: 1373.10 seconds
Step 59600/150000, Loss: 4.207626829147339, Test Loss: 4.220588862895966, LR: 0.00015, Elapsed Time: 1375.44 seconds
Step 59700/150000, Loss: 4.192788934707641, Test Loss: 4.221073865890503, LR: 0.00015, Elapsed Time: 1377.80 seconds
Step 59800/150000, Loss: 4.203920621871948, Test Loss: 4.219778835773468, LR: 0.00015, Elapsed Time: 1380.10 seconds
Step 59900/150000, Loss: 4.2117523765563964, Test Loss: 4.219462692737579, LR: 0.00015, Elapsed Time: 1382.40 seconds
Step 60000/150000, Loss: 4.200174951553345, Test Loss: 4.218112945556641, LR: 0.00015, Elapsed Time: 1384.71 seconds
Step 60100/150000, Loss: 4.216748452186584, Test Loss: 4.2169124484062195, LR: 0.00015, Elapsed Time: 1387.01 seconds
Step 60200/150000, Loss: 4.205207386016846, Test Loss: 4.2168344259262085, LR: 0.00015, Elapsed Time: 1389.34 seconds
Step 60300/150000, Loss: 4.201261177062988, Test Loss: 4.216366231441498, LR: 0.00015, Elapsed Time: 1391.71 seconds
Step 60400/150000, Loss: 4.21155424118042, Test Loss: 4.216029942035675, LR: 0.00015, Elapsed Time: 1394.04 seconds
Step 60500/150000, Loss: 4.212946329116821, Test Loss: 4.216948986053467, LR: 0.00015, Elapsed Time: 1396.33 seconds
Step 60600/150000, Loss: 4.208104743957519, Test Loss: 4.217379570007324, LR: 0.00015, Elapsed Time: 1398.63 seconds
Step 60700/150000, Loss: 4.199538240432739, Test Loss: 4.216808080673218, LR: 0.00015, Elapsed Time: 1400.93 seconds
Step 60800/150000, Loss: 4.196260404586792, Test Loss: 4.2163145542144775, LR: 0.00015, Elapsed Time: 1403.25 seconds
Step 60900/150000, Loss: 4.20731478691101, Test Loss: 4.2167311906814575, LR: 0.00015, Elapsed Time: 1405.61 seconds
Step 61000/150000, Loss: 4.20639634847641, Test Loss: 4.216674447059631, LR: 0.00015, Elapsed Time: 1407.96 seconds
Step 61100/150000, Loss: 4.218208889961243, Test Loss: 4.215912401676178, LR: 0.00015, Elapsed Time: 1410.26 seconds
Step 61200/150000, Loss: 4.1875737857818605, Test Loss: 4.215150713920593, LR: 0.00015, Elapsed Time: 1412.56 seconds
Step 61300/150000, Loss: 4.1913869667053225, Test Loss: 4.215145826339722, LR: 0.00015, Elapsed Time: 1414.87 seconds
Step 61400/150000, Loss: 4.200045986175537, Test Loss: 4.216356039047241, LR: 0.00015, Elapsed Time: 1417.18 seconds
Step 61500/150000, Loss: 4.200293171405792, Test Loss: 4.215396702289581, LR: 0.00015, Elapsed Time: 1419.53 seconds
Step 61600/150000, Loss: 4.209177780151367, Test Loss: 4.2142613530159, LR: 0.00015, Elapsed Time: 1421.88 seconds
Step 61700/150000, Loss: 4.2029937601089475, Test Loss: 4.214731395244598, LR: 0.00015, Elapsed Time: 1424.22 seconds
Step 61800/150000, Loss: 4.203972702026367, Test Loss: 4.216581642627716, LR: 0.00015, Elapsed Time: 1426.57 seconds
Step 61900/150000, Loss: 4.196447958946228, Test Loss: 4.214561104774475, LR: 0.00015, Elapsed Time: 1428.93 seconds
Step 62000/150000, Loss: 4.201084742546081, Test Loss: 4.21578323841095, LR: 0.00015, Elapsed Time: 1431.24 seconds
Step 62100/150000, Loss: 4.188150429725647, Test Loss: 4.213602006435394, LR: 0.00015, Elapsed Time: 1433.54 seconds
Step 62200/150000, Loss: 4.199957070350647, Test Loss: 4.21334707736969, LR: 0.00015, Elapsed Time: 1435.85 seconds
Step 62300/150000, Loss: 4.2121735954284665, Test Loss: 4.212685286998749, LR: 0.00015, Elapsed Time: 1438.14 seconds
Step 62400/150000, Loss: 4.2070403575897215, Test Loss: 4.211356699466705, LR: 0.00015, Elapsed Time: 1440.47 seconds
Step 62500/150000, Loss: 4.200024271011353, Test Loss: 4.2123571038246155, LR: 0.00015, Elapsed Time: 1442.84 seconds
Step 62600/150000, Loss: 4.20683177947998, Test Loss: 4.2109445333480835, LR: 0.00015, Elapsed Time: 1445.17 seconds
Step 62700/150000, Loss: 4.191628580093384, Test Loss: 4.211583733558655, LR: 0.00015, Elapsed Time: 1447.48 seconds
Step 62800/150000, Loss: 4.198788709640503, Test Loss: 4.213821351528168, LR: 0.00015, Elapsed Time: 1449.77 seconds
Step 62900/150000, Loss: 4.203905386924744, Test Loss: 4.212483704090118, LR: 0.00015, Elapsed Time: 1452.06 seconds
Step 63000/150000, Loss: 4.190712304115295, Test Loss: 4.211158335208893, LR: 0.00015, Elapsed Time: 1454.36 seconds
Step 63100/150000, Loss: 4.198591251373291, Test Loss: 4.211651086807251, LR: 0.00015, Elapsed Time: 1456.70 seconds
Step 63200/150000, Loss: 4.194988412857056, Test Loss: 4.2121482491493225, LR: 0.00015, Elapsed Time: 1459.02 seconds
Step 63300/150000, Loss: 4.196587152481079, Test Loss: 4.212681710720062, LR: 0.00015, Elapsed Time: 1461.29 seconds
Step 63400/150000, Loss: 4.19078818321228, Test Loss: 4.212838590145111, LR: 0.00015, Elapsed Time: 1463.55 seconds
Step 63500/150000, Loss: 4.199465398788452, Test Loss: 4.211721777915955, LR: 0.00015, Elapsed Time: 1465.81 seconds
Step 63600/150000, Loss: 4.1937617540359495, Test Loss: 4.205797553062439, LR: 4.4999999999999996e-05, Elapsed Time: 1468.07 seconds
Step 63700/150000, Loss: 4.193711538314819, Test Loss: 4.204088509082794, LR: 4.4999999999999996e-05, Elapsed Time: 1470.38 seconds
Step 63800/150000, Loss: 4.195339937210083, Test Loss: 4.2023826241493225, LR: 4.4999999999999996e-05, Elapsed Time: 1472.72 seconds
Step 63900/150000, Loss: 4.1855806183815005, Test Loss: 4.201195538043976, LR: 4.4999999999999996e-05, Elapsed Time: 1474.98 seconds
Step 64000/150000, Loss: 4.173540389537811, Test Loss: 4.201583683490753, LR: 4.4999999999999996e-05, Elapsed Time: 1477.23 seconds
Step 64100/150000, Loss: 4.190892176628113, Test Loss: 4.200059652328491, LR: 4.4999999999999996e-05, Elapsed Time: 1479.48 seconds
Step 64200/150000, Loss: 4.178377614021302, Test Loss: 4.199001610279083, LR: 4.4999999999999996e-05, Elapsed Time: 1481.75 seconds
Step 64300/150000, Loss: 4.187793996334076, Test Loss: 4.198975205421448, LR: 4.4999999999999996e-05, Elapsed Time: 1484.02 seconds
Step 64400/150000, Loss: 4.182182192802429, Test Loss: 4.197923541069031, LR: 4.4999999999999996e-05, Elapsed Time: 1486.36 seconds
Step 64500/150000, Loss: 4.187610676288605, Test Loss: 4.198365688323975, LR: 4.4999999999999996e-05, Elapsed Time: 1488.66 seconds
Step 64600/150000, Loss: 4.169953179359436, Test Loss: 4.19732666015625, LR: 4.4999999999999996e-05, Elapsed Time: 1490.97 seconds
Step 64700/150000, Loss: 4.184415249824524, Test Loss: 4.196868598461151, LR: 4.4999999999999996e-05, Elapsed Time: 1493.24 seconds
Step 64800/150000, Loss: 4.192635822296142, Test Loss: 4.1971317529678345, LR: 4.4999999999999996e-05, Elapsed Time: 1495.50 seconds
Step 64900/150000, Loss: 4.1842339015007015, Test Loss: 4.196547865867615, LR: 4.4999999999999996e-05, Elapsed Time: 1497.74 seconds
Step 65000/150000, Loss: 4.195247287750244, Test Loss: 4.196508586406708, LR: 4.4999999999999996e-05, Elapsed Time: 1500.00 seconds
Step 65100/150000, Loss: 4.183826994895935, Test Loss: 4.196314454078674, LR: 4.4999999999999996e-05, Elapsed Time: 1502.25 seconds
Step 65200/150000, Loss: 4.193687920570373, Test Loss: 4.195671737194061, LR: 4.4999999999999996e-05, Elapsed Time: 1504.50 seconds
Step 65300/150000, Loss: 4.180678551197052, Test Loss: 4.196056485176086, LR: 4.4999999999999996e-05, Elapsed Time: 1506.74 seconds
Step 65400/150000, Loss: 4.1790737795829775, Test Loss: 4.194997191429138, LR: 4.4999999999999996e-05, Elapsed Time: 1508.99 seconds
Step 65500/150000, Loss: 4.179036350250244, Test Loss: 4.1949655413627625, LR: 4.4999999999999996e-05, Elapsed Time: 1511.25 seconds
Step 65600/150000, Loss: 4.182316365242005, Test Loss: 4.194827854633331, LR: 4.4999999999999996e-05, Elapsed Time: 1513.51 seconds
Step 65700/150000, Loss: 4.177092885971069, Test Loss: 4.194788575172424, LR: 4.4999999999999996e-05, Elapsed Time: 1515.77 seconds
Step 65800/150000, Loss: 4.178902359008789, Test Loss: 4.19433069229126, LR: 4.4999999999999996e-05, Elapsed Time: 1518.02 seconds
Step 65900/150000, Loss: 4.176541385650634, Test Loss: 4.193676054477692, LR: 4.4999999999999996e-05, Elapsed Time: 1520.27 seconds
Step 66000/150000, Loss: 4.172586746215821, Test Loss: 4.193652272224426, LR: 4.4999999999999996e-05, Elapsed Time: 1522.53 seconds
Step 66100/150000, Loss: 4.189567394256592, Test Loss: 4.1932530999183655, LR: 4.4999999999999996e-05, Elapsed Time: 1524.79 seconds
Step 66200/150000, Loss: 4.173743147850036, Test Loss: 4.193467915058136, LR: 4.4999999999999996e-05, Elapsed Time: 1527.05 seconds
Step 66300/150000, Loss: 4.18244366645813, Test Loss: 4.193166792392731, LR: 4.4999999999999996e-05, Elapsed Time: 1529.30 seconds
Step 66400/150000, Loss: 4.177334651947022, Test Loss: 4.193263471126556, LR: 4.4999999999999996e-05, Elapsed Time: 1531.56 seconds
Step 66500/150000, Loss: 4.173064517974853, Test Loss: 4.193430066108704, LR: 4.4999999999999996e-05, Elapsed Time: 1533.83 seconds
Step 66600/150000, Loss: 4.1676596879959105, Test Loss: 4.193590760231018, LR: 4.4999999999999996e-05, Elapsed Time: 1536.08 seconds
Step 66700/150000, Loss: 4.16289067029953, Test Loss: 4.193353295326233, LR: 4.4999999999999996e-05, Elapsed Time: 1538.33 seconds
Step 66800/150000, Loss: 4.173309009075165, Test Loss: 4.192658722400665, LR: 4.4999999999999996e-05, Elapsed Time: 1540.59 seconds
Step 66900/150000, Loss: 4.17137951374054, Test Loss: 4.1930952072143555, LR: 4.4999999999999996e-05, Elapsed Time: 1542.85 seconds
Step 67000/150000, Loss: 4.16524498462677, Test Loss: 4.19273054599762, LR: 4.4999999999999996e-05, Elapsed Time: 1545.11 seconds
Step 67100/150000, Loss: 4.186121792793274, Test Loss: 4.19260835647583, LR: 4.4999999999999996e-05, Elapsed Time: 1547.36 seconds
Step 67200/150000, Loss: 4.174588141441345, Test Loss: 4.192525923252106, LR: 4.4999999999999996e-05, Elapsed Time: 1549.62 seconds
Step 67300/150000, Loss: 4.174054827690124, Test Loss: 4.192156732082367, LR: 4.4999999999999996e-05, Elapsed Time: 1551.88 seconds
Step 67400/150000, Loss: 4.180992479324341, Test Loss: 4.192418336868286, LR: 4.4999999999999996e-05, Elapsed Time: 1554.14 seconds
Step 67500/150000, Loss: 4.16102970123291, Test Loss: 4.191923558712006, LR: 4.4999999999999996e-05, Elapsed Time: 1556.40 seconds
Step 67600/150000, Loss: 4.168528938293457, Test Loss: 4.1924185156822205, LR: 4.4999999999999996e-05, Elapsed Time: 1558.64 seconds
Step 67700/150000, Loss: 4.18139458656311, Test Loss: 4.191772639751434, LR: 4.4999999999999996e-05, Elapsed Time: 1560.89 seconds
Step 67800/150000, Loss: 4.1733353281021115, Test Loss: 4.190920054912567, LR: 4.4999999999999996e-05, Elapsed Time: 1563.15 seconds
Step 67900/150000, Loss: 4.170492596626282, Test Loss: 4.190254330635071, LR: 4.4999999999999996e-05, Elapsed Time: 1565.40 seconds
Step 68000/150000, Loss: 4.173598787784576, Test Loss: 4.191294074058533, LR: 4.4999999999999996e-05, Elapsed Time: 1567.65 seconds
Step 68100/150000, Loss: 4.176061182022095, Test Loss: 4.191223740577698, LR: 4.4999999999999996e-05, Elapsed Time: 1569.91 seconds
Step 68200/150000, Loss: 4.165406932830811, Test Loss: 4.1910040974617, LR: 4.4999999999999996e-05, Elapsed Time: 1572.16 seconds
Step 68300/150000, Loss: 4.165218558311462, Test Loss: 4.191352427005768, LR: 4.4999999999999996e-05, Elapsed Time: 1574.41 seconds
Step 68400/150000, Loss: 4.169144473075867, Test Loss: 4.191169917583466, LR: 4.4999999999999996e-05, Elapsed Time: 1576.67 seconds
Step 68500/150000, Loss: 4.163315939903259, Test Loss: 4.190852105617523, LR: 4.4999999999999996e-05, Elapsed Time: 1578.93 seconds
Step 68600/150000, Loss: 4.164966177940369, Test Loss: 4.190586149692535, LR: 4.4999999999999996e-05, Elapsed Time: 1581.19 seconds
Step 68700/150000, Loss: 4.164487910270691, Test Loss: 4.190276265144348, LR: 4.4999999999999996e-05, Elapsed Time: 1583.44 seconds
Step 68800/150000, Loss: 4.175600085258484, Test Loss: 4.190540432929993, LR: 4.4999999999999996e-05, Elapsed Time: 1585.70 seconds
Step 68900/150000, Loss: 4.1664080142974855, Test Loss: 4.191003978252411, LR: 4.4999999999999996e-05, Elapsed Time: 1587.96 seconds
Step 69000/150000, Loss: 4.172815842628479, Test Loss: 4.190832018852234, LR: 4.4999999999999996e-05, Elapsed Time: 1590.21 seconds
Step 69100/150000, Loss: 4.17620402097702, Test Loss: 4.189656436443329, LR: 1.3499999999999998e-05, Elapsed Time: 1592.48 seconds
Step 69200/150000, Loss: 4.175076713562012, Test Loss: 4.189188003540039, LR: 1.3499999999999998e-05, Elapsed Time: 1594.74 seconds
Step 69300/150000, Loss: 4.1632839918136595, Test Loss: 4.1887317299842834, LR: 1.3499999999999998e-05, Elapsed Time: 1597.00 seconds
Step 69400/150000, Loss: 4.166015331745148, Test Loss: 4.188775420188904, LR: 1.3499999999999998e-05, Elapsed Time: 1599.26 seconds
Step 69500/150000, Loss: 4.176146988868713, Test Loss: 4.188641548156738, LR: 1.3499999999999998e-05, Elapsed Time: 1601.51 seconds
Step 69600/150000, Loss: 4.165975031852722, Test Loss: 4.188302516937256, LR: 1.3499999999999998e-05, Elapsed Time: 1603.77 seconds
Step 69700/150000, Loss: 4.170336475372315, Test Loss: 4.188397824764252, LR: 1.3499999999999998e-05, Elapsed Time: 1606.02 seconds
Step 69800/150000, Loss: 4.159849877357483, Test Loss: 4.1881818771362305, LR: 1.3499999999999998e-05, Elapsed Time: 1608.27 seconds
Step 69900/150000, Loss: 4.1636023783683775, Test Loss: 4.187955558300018, LR: 1.3499999999999998e-05, Elapsed Time: 1610.54 seconds
Step 70000/150000, Loss: 4.17270215511322, Test Loss: 4.187670826911926, LR: 1.3499999999999998e-05, Elapsed Time: 1612.81 seconds
Step 70100/150000, Loss: 4.153892529010773, Test Loss: 4.187719225883484, LR: 1.3499999999999998e-05, Elapsed Time: 1615.06 seconds
Step 70200/150000, Loss: 4.166352891921997, Test Loss: 4.188002347946167, LR: 1.3499999999999998e-05, Elapsed Time: 1617.31 seconds
Step 70300/150000, Loss: 4.167788014411927, Test Loss: 4.187557637691498, LR: 1.3499999999999998e-05, Elapsed Time: 1619.57 seconds
Step 70400/150000, Loss: 4.163088550567627, Test Loss: 4.187866806983948, LR: 1.3499999999999998e-05, Elapsed Time: 1621.83 seconds
Step 70500/150000, Loss: 4.170417153835297, Test Loss: 4.187388598918915, LR: 1.3499999999999998e-05, Elapsed Time: 1624.07 seconds
Step 70600/150000, Loss: 4.163733742237091, Test Loss: 4.187606155872345, LR: 1.3499999999999998e-05, Elapsed Time: 1626.33 seconds
Step 70700/150000, Loss: 4.154749577045441, Test Loss: 4.187497496604919, LR: 1.3499999999999998e-05, Elapsed Time: 1628.58 seconds
Step 70800/150000, Loss: 4.168543229103088, Test Loss: 4.187367916107178, LR: 1.3499999999999998e-05, Elapsed Time: 1630.85 seconds
Step 70900/150000, Loss: 4.151924521923065, Test Loss: 4.187236428260803, LR: 1.3499999999999998e-05, Elapsed Time: 1633.10 seconds
Step 71000/150000, Loss: 4.159869809150695, Test Loss: 4.187285423278809, LR: 1.3499999999999998e-05, Elapsed Time: 1635.36 seconds
Step 71100/150000, Loss: 4.163783664703369, Test Loss: 4.187179327011108, LR: 1.3499999999999998e-05, Elapsed Time: 1637.62 seconds
Step 71200/150000, Loss: 4.166450257301331, Test Loss: 4.187014579772949, LR: 1.3499999999999998e-05, Elapsed Time: 1639.88 seconds
Step 71300/150000, Loss: 4.154879701137543, Test Loss: 4.187195658683777, LR: 1.3499999999999998e-05, Elapsed Time: 1642.14 seconds
Step 71400/150000, Loss: 4.154628357887268, Test Loss: 4.187214434146881, LR: 1.3499999999999998e-05, Elapsed Time: 1644.40 seconds
Step 71500/150000, Loss: 4.156673998832702, Test Loss: 4.187053978443146, LR: 1.3499999999999998e-05, Elapsed Time: 1646.65 seconds
Step 71600/150000, Loss: 4.170062417984009, Test Loss: 4.18699049949646, LR: 1.3499999999999998e-05, Elapsed Time: 1648.91 seconds
Step 71700/150000, Loss: 4.157615098953247, Test Loss: 4.187010586261749, LR: 1.3499999999999998e-05, Elapsed Time: 1651.17 seconds
Step 71800/150000, Loss: 4.1667045307159425, Test Loss: 4.186932444572449, LR: 1.3499999999999998e-05, Elapsed Time: 1653.43 seconds
Step 71900/150000, Loss: 4.160829153060913, Test Loss: 4.1869842410087585, LR: 1.3499999999999998e-05, Elapsed Time: 1655.68 seconds
Step 72000/150000, Loss: 4.1560625028610225, Test Loss: 4.1870726346969604, LR: 1.3499999999999998e-05, Elapsed Time: 1657.94 seconds
Step 72100/150000, Loss: 4.155453739166259, Test Loss: 4.186811983585358, LR: 5e-06, Elapsed Time: 1660.20 seconds
Step 72200/150000, Loss: 4.163777863979339, Test Loss: 4.186845123767853, LR: 5e-06, Elapsed Time: 1662.46 seconds
Step 72300/150000, Loss: 4.163998665809632, Test Loss: 4.186637043952942, LR: 5e-06, Elapsed Time: 1664.72 seconds
Step 72400/150000, Loss: 4.152436180114746, Test Loss: 4.186588704586029, LR: 5e-06, Elapsed Time: 1666.99 seconds
Step 72500/150000, Loss: 4.145992007255554, Test Loss: 4.186447501182556, LR: 5e-06, Elapsed Time: 1669.25 seconds
Step 72600/150000, Loss: 4.156301505565644, Test Loss: 4.1863210797309875, LR: 5e-06, Elapsed Time: 1671.51 seconds
Step 72700/150000, Loss: 4.146506154537201, Test Loss: 4.186343193054199, LR: 5e-06, Elapsed Time: 1673.78 seconds
Step 72800/150000, Loss: 4.161303169727326, Test Loss: 4.1862775683403015, LR: 5e-06, Elapsed Time: 1676.05 seconds
Step 72900/150000, Loss: 4.147039752006531, Test Loss: 4.186254918575287, LR: 5e-06, Elapsed Time: 1678.30 seconds
Step 73000/150000, Loss: 4.1482584238052365, Test Loss: 4.186004459857941, LR: 5e-06, Elapsed Time: 1680.56 seconds
Step 73100/150000, Loss: 4.156509981155396, Test Loss: 4.18611466884613, LR: 5e-06, Elapsed Time: 1682.83 seconds
Step 73200/150000, Loss: 4.149578437805176, Test Loss: 4.186041355133057, LR: 5e-06, Elapsed Time: 1685.08 seconds
Step 73300/150000, Loss: 4.154433546066284, Test Loss: 4.185933411121368, LR: 5e-06, Elapsed Time: 1687.34 seconds
Step 73400/150000, Loss: 4.151546678543091, Test Loss: 4.186014235019684, LR: 5e-06, Elapsed Time: 1689.60 seconds
Step 73500/150000, Loss: 4.138417809009552, Test Loss: 4.185975909233093, LR: 5e-06, Elapsed Time: 1691.87 seconds
Step 73600/150000, Loss: 4.14903832912445, Test Loss: 4.185869216918945, LR: 5e-06, Elapsed Time: 1694.13 seconds
Step 73700/150000, Loss: 4.14946346282959, Test Loss: 4.185847580432892, LR: 5e-06, Elapsed Time: 1696.38 seconds
Step 73800/150000, Loss: 4.150955853462219, Test Loss: 4.185855984687805, LR: 5e-06, Elapsed Time: 1698.63 seconds
Step 73900/150000, Loss: 4.151494860649109, Test Loss: 4.185780227184296, LR: 5e-06, Elapsed Time: 1700.89 seconds
Step 74000/150000, Loss: 4.155793912410736, Test Loss: 4.18576967716217, LR: 5e-06, Elapsed Time: 1703.13 seconds
Step 74100/150000, Loss: 4.149139895439148, Test Loss: 4.185834169387817, LR: 5e-06, Elapsed Time: 1705.40 seconds
Step 74200/150000, Loss: 4.147499079704285, Test Loss: 4.185734748840332, LR: 5e-06, Elapsed Time: 1707.67 seconds
Step 74300/150000, Loss: 4.1459405851364135, Test Loss: 4.18561190366745, LR: 5e-06, Elapsed Time: 1709.94 seconds
Step 74400/150000, Loss: 4.140422947406769, Test Loss: 4.185671329498291, LR: 5e-06, Elapsed Time: 1712.19 seconds
Step 74500/150000, Loss: 4.151492872238159, Test Loss: 4.185814380645752, LR: 5e-06, Elapsed Time: 1714.45 seconds
Step 74600/150000, Loss: 4.152321305274963, Test Loss: 4.185787260532379, LR: 5e-06, Elapsed Time: 1716.71 seconds
Step 74700/150000, Loss: 4.1408957719802855, Test Loss: 4.1856489181518555, LR: 5e-06, Elapsed Time: 1718.96 seconds
Step 74800/150000, Loss: 4.148496265411377, Test Loss: 4.185710787773132, LR: 5e-06, Elapsed Time: 1721.21 seconds
Step 74900/150000, Loss: 4.156952781677246, Test Loss: 4.185557126998901, LR: 5e-06, Elapsed Time: 1723.46 seconds
Step 75000/150000, Loss: 4.13872855424881, Test Loss: 4.185812056064606, LR: 5e-06, Elapsed Time: 1725.71 seconds
Step 75100/150000, Loss: 4.13018795967102, Test Loss: 4.1857094168663025, LR: 5e-06, Elapsed Time: 1727.97 seconds
Step 75200/150000, Loss: 4.148472566604614, Test Loss: 4.185604155063629, LR: 5e-06, Elapsed Time: 1730.21 seconds
Step 75300/150000, Loss: 4.145106382369995, Test Loss: 4.1856414675712585, LR: 5e-06, Elapsed Time: 1732.46 seconds
Step 75400/150000, Loss: 4.140119795799255, Test Loss: 4.185621082782745, LR: 5e-06, Elapsed Time: 1734.70 seconds
Step 75500/150000, Loss: 4.141389188766479, Test Loss: 4.185708940029144, LR: 5e-06, Elapsed Time: 1736.94 seconds
Step 75600/150000, Loss: 4.130235946178436, Test Loss: 4.185527324676514, LR: 5e-06, Elapsed Time: 1739.20 seconds
Step 75700/150000, Loss: 4.145750904083252, Test Loss: 4.185644090175629, LR: 5e-06, Elapsed Time: 1741.47 seconds
Step 75800/150000, Loss: 4.143891983032226, Test Loss: 4.185523331165314, LR: 5e-06, Elapsed Time: 1743.73 seconds
Step 75900/150000, Loss: 4.148354635238648, Test Loss: 4.185573399066925, LR: 5e-06, Elapsed Time: 1745.99 seconds
Step 76000/150000, Loss: 4.142433052062988, Test Loss: 4.18557071685791, LR: 5e-06, Elapsed Time: 1748.25 seconds
Step 76100/150000, Loss: 4.14603440284729, Test Loss: 4.185494124889374, LR: 5e-06, Elapsed Time: 1750.51 seconds
Step 76200/150000, Loss: 4.144834017753601, Test Loss: 4.185433208942413, LR: 5e-06, Elapsed Time: 1752.76 seconds
Step 76300/150000, Loss: 4.137983393669129, Test Loss: 4.185459196567535, LR: 5e-06, Elapsed Time: 1755.01 seconds
Step 76400/150000, Loss: 4.146612558364868, Test Loss: 4.185526669025421, LR: 5e-06, Elapsed Time: 1757.26 seconds
Step 76500/150000, Loss: 4.143929324150085, Test Loss: 4.185377299785614, LR: 5e-06, Elapsed Time: 1759.52 seconds
Step 76600/150000, Loss: 4.150637731552124, Test Loss: 4.185503304004669, LR: 5e-06, Elapsed Time: 1761.78 seconds
Step 76700/150000, Loss: 4.126633954048157, Test Loss: 4.1854047775268555, LR: 5e-06, Elapsed Time: 1764.03 seconds
Step 76800/150000, Loss: 4.135329990386963, Test Loss: 4.185335278511047, LR: 5e-06, Elapsed Time: 1766.29 seconds
Step 76900/150000, Loss: 4.139923787117004, Test Loss: 4.185485482215881, LR: 5e-06, Elapsed Time: 1768.55 seconds
Step 77000/150000, Loss: 4.12781970500946, Test Loss: 4.185505151748657, LR: 5e-06, Elapsed Time: 1770.80 seconds
Step 77100/150000, Loss: 4.136189231872558, Test Loss: 4.185498297214508, LR: 5e-06, Elapsed Time: 1773.06 seconds
Step 77200/150000, Loss: 4.139810266494751, Test Loss: 4.1855310797691345, LR: 5e-06, Elapsed Time: 1775.32 seconds
Step 77300/150000, Loss: 4.139953150749206, Test Loss: 4.185369312763214, LR: 5e-06, Elapsed Time: 1777.60 seconds
Step 77400/150000, Loss: 4.126514599323273, Test Loss: 4.185569941997528, LR: 5e-06, Elapsed Time: 1779.85 seconds
Step 77500/150000, Loss: 4.128481736183167, Test Loss: 4.18559056520462, LR: 5e-06, Elapsed Time: 1782.12 seconds
Step 77600/150000, Loss: 4.1440234065055845, Test Loss: 4.185402929782867, LR: 5e-06, Elapsed Time: 1784.37 seconds
Step 77700/150000, Loss: 4.17192126750946, Test Loss: 4.185186088085175, LR: 5e-06, Elapsed Time: 1786.62 seconds
Step 77800/150000, Loss: 4.177226414680481, Test Loss: 4.1850749254226685, LR: 5e-06, Elapsed Time: 1788.88 seconds
Step 77900/150000, Loss: 4.159675400257111, Test Loss: 4.185079574584961, LR: 5e-06, Elapsed Time: 1791.15 seconds
Step 78000/150000, Loss: 4.166853523254394, Test Loss: 4.185042679309845, LR: 5e-06, Elapsed Time: 1793.40 seconds
Step 78100/150000, Loss: 4.158627812862396, Test Loss: 4.185085773468018, LR: 5e-06, Elapsed Time: 1795.67 seconds
Step 78200/150000, Loss: 4.165379462242126, Test Loss: 4.1849653124809265, LR: 5e-06, Elapsed Time: 1797.91 seconds
Step 78300/150000, Loss: 4.169523615837097, Test Loss: 4.1850961446762085, LR: 5e-06, Elapsed Time: 1800.18 seconds
Step 78400/150000, Loss: 4.15878701210022, Test Loss: 4.185166776180267, LR: 5e-06, Elapsed Time: 1802.45 seconds
Step 78500/150000, Loss: 4.164877481460572, Test Loss: 4.1851197481155396, LR: 5e-06, Elapsed Time: 1804.71 seconds
Step 78600/150000, Loss: 4.1620083427429195, Test Loss: 4.185107350349426, LR: 5e-06, Elapsed Time: 1806.97 seconds
Step 78700/150000, Loss: 4.158428740501404, Test Loss: 4.185027480125427, LR: 5e-06, Elapsed Time: 1809.23 seconds
Step 78800/150000, Loss: 4.15625657081604, Test Loss: 4.185107469558716, LR: 5e-06, Elapsed Time: 1811.50 seconds
Step 78900/150000, Loss: 4.161324915885925, Test Loss: 4.185108244419098, LR: 5e-06, Elapsed Time: 1813.76 seconds
Step 79000/150000, Loss: 4.17138774394989, Test Loss: 4.185007154941559, LR: 5e-06, Elapsed Time: 1816.03 seconds
Step 79100/150000, Loss: 4.1593982076644895, Test Loss: 4.1851208209991455, LR: 5e-06, Elapsed Time: 1818.29 seconds
Step 79200/150000, Loss: 4.152566609382629, Test Loss: 4.184969425201416, LR: 5e-06, Elapsed Time: 1820.55 seconds
Step 79300/150000, Loss: 4.15707941532135, Test Loss: 4.1849387884140015, LR: 5e-06, Elapsed Time: 1822.83 seconds
Step 79400/150000, Loss: 4.157061395645141, Test Loss: 4.184900522232056, LR: 5e-06, Elapsed Time: 1825.07 seconds
Step 79500/150000, Loss: 4.159421668052674, Test Loss: 4.184811770915985, LR: 5e-06, Elapsed Time: 1827.33 seconds
Step 79600/150000, Loss: 4.164738459587097, Test Loss: 4.1846354603767395, LR: 5e-06, Elapsed Time: 1829.59 seconds
Step 79700/150000, Loss: 4.152469029426575, Test Loss: 4.1847057938575745, LR: 5e-06, Elapsed Time: 1831.86 seconds
Step 79800/150000, Loss: 4.165579719543457, Test Loss: 4.184767484664917, LR: 5e-06, Elapsed Time: 1834.12 seconds
Step 79900/150000, Loss: 4.158928921222687, Test Loss: 4.184775948524475, LR: 5e-06, Elapsed Time: 1836.38 seconds
Step 80000/150000, Loss: 4.1548524618148805, Test Loss: 4.184822261333466, LR: 5e-06, Elapsed Time: 1838.63 seconds
Step 80100/150000, Loss: 4.164975750446319, Test Loss: 4.184824824333191, LR: 5e-06, Elapsed Time: 1840.89 seconds
Step 80200/150000, Loss: 4.155625972747803, Test Loss: 4.184835255146027, LR: 5e-06, Elapsed Time: 1843.14 seconds
Step 80300/150000, Loss: 4.142808480262756, Test Loss: 4.1847898960113525, LR: 5e-06, Elapsed Time: 1845.39 seconds
Step 80400/150000, Loss: 4.154113845825195, Test Loss: 4.1848562359809875, LR: 5e-06, Elapsed Time: 1847.65 seconds
Step 80500/150000, Loss: 4.158547351360321, Test Loss: 4.184978127479553, LR: 5e-06, Elapsed Time: 1849.91 seconds
Step 80600/150000, Loss: 4.154414210319519, Test Loss: 4.184956610202789, LR: 5e-06, Elapsed Time: 1852.17 seconds
Step 80700/150000, Loss: 4.1591877746582036, Test Loss: 4.184852719306946, LR: 5e-06, Elapsed Time: 1854.42 seconds
Step 80800/150000, Loss: 4.149473810195923, Test Loss: 4.184854865074158, LR: 5e-06, Elapsed Time: 1856.68 seconds
Step 80900/150000, Loss: 4.149453635215759, Test Loss: 4.1848613023757935, LR: 5e-06, Elapsed Time: 1858.94 seconds
Step 81000/150000, Loss: 4.148298058509827, Test Loss: 4.184964954853058, LR: 5e-06, Elapsed Time: 1861.21 seconds
Step 81100/150000, Loss: 4.1712527132034305, Test Loss: 4.1847540736198425, LR: 5e-06, Elapsed Time: 1863.47 seconds
Step 81200/150000, Loss: 4.144808578491211, Test Loss: 4.184790372848511, LR: 5e-06, Elapsed Time: 1865.73 seconds
Step 81300/150000, Loss: 4.157258317470551, Test Loss: 4.18467253446579, LR: 5e-06, Elapsed Time: 1868.00 seconds
Step 81400/150000, Loss: 4.1620003652572635, Test Loss: 4.184818148612976, LR: 5e-06, Elapsed Time: 1870.26 seconds
Step 81500/150000, Loss: 4.143972134590149, Test Loss: 4.184668600559235, LR: 5e-06, Elapsed Time: 1872.53 seconds
Step 81600/150000, Loss: 4.146478779315949, Test Loss: 4.1846619844436646, LR: 5e-06, Elapsed Time: 1874.79 seconds
Step 81700/150000, Loss: 4.1465218019485475, Test Loss: 4.184607565402985, LR: 5e-06, Elapsed Time: 1877.06 seconds
Step 81800/150000, Loss: 4.162878937721253, Test Loss: 4.184605896472931, LR: 5e-06, Elapsed Time: 1879.33 seconds
Step 81900/150000, Loss: 4.150522487163544, Test Loss: 4.184553503990173, LR: 5e-06, Elapsed Time: 1881.60 seconds
Step 82000/150000, Loss: 4.1588552379608155, Test Loss: 4.184423923492432, LR: 5e-06, Elapsed Time: 1883.85 seconds
Step 82100/150000, Loss: 4.152663180828094, Test Loss: 4.1844266057014465, LR: 5e-06, Elapsed Time: 1886.12 seconds
Step 82200/150000, Loss: 4.149004077911377, Test Loss: 4.184476971626282, LR: 5e-06, Elapsed Time: 1888.38 seconds
Step 82300/150000, Loss: 4.158762063980102, Test Loss: 4.184332311153412, LR: 5e-06, Elapsed Time: 1890.64 seconds
Step 82400/150000, Loss: 4.157428297996521, Test Loss: 4.184389889240265, LR: 5e-06, Elapsed Time: 1892.89 seconds
Step 82500/150000, Loss: 4.149614143371582, Test Loss: 4.184585332870483, LR: 5e-06, Elapsed Time: 1895.15 seconds
Step 82600/150000, Loss: 4.148299100399018, Test Loss: 4.184548437595367, LR: 5e-06, Elapsed Time: 1897.41 seconds
Step 82700/150000, Loss: 4.144604506492615, Test Loss: 4.184411883354187, LR: 5e-06, Elapsed Time: 1899.68 seconds
Step 82800/150000, Loss: 4.15677001953125, Test Loss: 4.1845786571502686, LR: 5e-06, Elapsed Time: 1901.96 seconds
Step 82900/150000, Loss: 4.152357552051544, Test Loss: 4.1844412088394165, LR: 5e-06, Elapsed Time: 1904.22 seconds
Step 83000/150000, Loss: 4.163792767524719, Test Loss: 4.18435275554657, LR: 5e-06, Elapsed Time: 1906.48 seconds
Step 83100/150000, Loss: 4.129580693244934, Test Loss: 4.184370577335358, LR: 5e-06, Elapsed Time: 1908.73 seconds
Step 83200/150000, Loss: 4.138106846809388, Test Loss: 4.184455990791321, LR: 5e-06, Elapsed Time: 1910.99 seconds
Step 83300/150000, Loss: 4.148853743076325, Test Loss: 4.184538960456848, LR: 5e-06, Elapsed Time: 1913.25 seconds
Step 83400/150000, Loss: 4.148370714187622, Test Loss: 4.1843520402908325, LR: 5e-06, Elapsed Time: 1915.50 seconds
Step 83500/150000, Loss: 4.1542469501495365, Test Loss: 4.184362471103668, LR: 5e-06, Elapsed Time: 1917.76 seconds
Step 83600/150000, Loss: 4.147538938522339, Test Loss: 4.184317409992218, LR: 5e-06, Elapsed Time: 1920.00 seconds
Step 83700/150000, Loss: 4.154468150138855, Test Loss: 4.184334456920624, LR: 5e-06, Elapsed Time: 1922.26 seconds
Step 83800/150000, Loss: 4.1390941715240475, Test Loss: 4.184287846088409, LR: 5e-06, Elapsed Time: 1924.51 seconds
Step 83900/150000, Loss: 4.14953473329544, Test Loss: 4.184296190738678, LR: 5e-06, Elapsed Time: 1926.78 seconds
Step 84000/150000, Loss: 4.134500205516815, Test Loss: 4.184376776218414, LR: 5e-06, Elapsed Time: 1929.04 seconds
Step 84100/150000, Loss: 4.148134100437164, Test Loss: 4.184421479701996, LR: 5e-06, Elapsed Time: 1931.30 seconds
Step 84200/150000, Loss: 4.155234177112579, Test Loss: 4.184223532676697, LR: 5e-06, Elapsed Time: 1933.56 seconds
Step 84300/150000, Loss: 4.150139055252075, Test Loss: 4.184188902378082, LR: 5e-06, Elapsed Time: 1935.82 seconds
Step 84400/150000, Loss: 4.147102508544922, Test Loss: 4.184170186519623, LR: 5e-06, Elapsed Time: 1938.07 seconds
Step 84500/150000, Loss: 4.150843300819397, Test Loss: 4.1841921210289, LR: 5e-06, Elapsed Time: 1940.34 seconds
Step 84600/150000, Loss: 4.136047978401184, Test Loss: 4.184291422367096, LR: 5e-06, Elapsed Time: 1942.59 seconds
Step 84700/150000, Loss: 4.146226062774658, Test Loss: 4.184329807758331, LR: 5e-06, Elapsed Time: 1944.85 seconds
Step 84800/150000, Loss: 4.1453132462501525, Test Loss: 4.1841800808906555, LR: 5e-06, Elapsed Time: 1947.11 seconds
Step 84900/150000, Loss: 4.1269403743743895, Test Loss: 4.184243679046631, LR: 5e-06, Elapsed Time: 1949.37 seconds
Step 85000/150000, Loss: 4.1517276263237, Test Loss: 4.184073567390442, LR: 5e-06, Elapsed Time: 1951.62 seconds
Step 85100/150000, Loss: 4.132954428195953, Test Loss: 4.184085786342621, LR: 5e-06, Elapsed Time: 1953.88 seconds
Step 85200/150000, Loss: 4.142291564941406, Test Loss: 4.184191584587097, LR: 5e-06, Elapsed Time: 1956.14 seconds
Step 85300/150000, Loss: 4.135158443450928, Test Loss: 4.184432625770569, LR: 5e-06, Elapsed Time: 1958.39 seconds
Step 85400/150000, Loss: 4.148129022121429, Test Loss: 4.18435400724411, LR: 5e-06, Elapsed Time: 1960.66 seconds
Step 85500/150000, Loss: 4.162870740890503, Test Loss: 4.184195160865784, LR: 5e-06, Elapsed Time: 1962.92 seconds
Step 85600/150000, Loss: 4.168416609764099, Test Loss: 4.184270918369293, LR: 5e-06, Elapsed Time: 1965.17 seconds
Step 85700/150000, Loss: 4.1615572762489315, Test Loss: 4.184202194213867, LR: 5e-06, Elapsed Time: 1967.43 seconds
Step 85800/150000, Loss: 4.161231870651245, Test Loss: 4.184149622917175, LR: 5e-06, Elapsed Time: 1969.69 seconds
Step 85900/150000, Loss: 4.149889643192291, Test Loss: 4.184277296066284, LR: 5e-06, Elapsed Time: 1971.95 seconds
Step 86000/150000, Loss: 4.163598098754883, Test Loss: 4.1842080950737, LR: 5e-06, Elapsed Time: 1974.21 seconds
Step 86100/150000, Loss: 4.158399243354797, Test Loss: 4.184106409549713, LR: 5e-06, Elapsed Time: 1976.46 seconds
Step 86200/150000, Loss: 4.1555857706069945, Test Loss: 4.1841961145401, LR: 5e-06, Elapsed Time: 1978.71 seconds
Step 86300/150000, Loss: 4.165143394470215, Test Loss: 4.184028744697571, LR: 5e-06, Elapsed Time: 1980.97 seconds
Step 86400/150000, Loss: 4.153741688728332, Test Loss: 4.184206783771515, LR: 5e-06, Elapsed Time: 1983.23 seconds
Step 86500/150000, Loss: 4.149925351142883, Test Loss: 4.184090852737427, LR: 5e-06, Elapsed Time: 1985.49 seconds
Step 86600/150000, Loss: 4.172379155158996, Test Loss: 4.183995306491852, LR: 5e-06, Elapsed Time: 1987.74 seconds
Step 86700/150000, Loss: 4.160453786849976, Test Loss: 4.184096992015839, LR: 5e-06, Elapsed Time: 1989.98 seconds
Step 86800/150000, Loss: 4.165561594963074, Test Loss: 4.18401825428009, LR: 5e-06, Elapsed Time: 1992.24 seconds
Step 86900/150000, Loss: 4.171698975563049, Test Loss: 4.18390291929245, LR: 5e-06, Elapsed Time: 1994.49 seconds
Step 87000/150000, Loss: 4.157347211837768, Test Loss: 4.184033155441284, LR: 5e-06, Elapsed Time: 1996.75 seconds
Step 87100/150000, Loss: 4.176164464950562, Test Loss: 4.183975279331207, LR: 5e-06, Elapsed Time: 1999.00 seconds
Step 87200/150000, Loss: 4.156792795658111, Test Loss: 4.183990776538849, LR: 5e-06, Elapsed Time: 2001.25 seconds
Step 87300/150000, Loss: 4.163677983283996, Test Loss: 4.184051275253296, LR: 5e-06, Elapsed Time: 2003.50 seconds
Step 87400/150000, Loss: 4.142675805091858, Test Loss: 4.184219241142273, LR: 5e-06, Elapsed Time: 2005.76 seconds
Step 87500/150000, Loss: 4.171035342216491, Test Loss: 4.1840837597846985, LR: 5e-06, Elapsed Time: 2008.02 seconds
Step 87600/150000, Loss: 4.149381923675537, Test Loss: 4.184144616127014, LR: 5e-06, Elapsed Time: 2010.28 seconds
Step 87700/150000, Loss: 4.1552663946151736, Test Loss: 4.18398654460907, LR: 5e-06, Elapsed Time: 2012.53 seconds
Step 87800/150000, Loss: 4.161367602348328, Test Loss: 4.183851420879364, LR: 5e-06, Elapsed Time: 2014.79 seconds
Step 87900/150000, Loss: 4.151983733177185, Test Loss: 4.183811783790588, LR: 5e-06, Elapsed Time: 2017.03 seconds
Step 88000/150000, Loss: 4.16240348815918, Test Loss: 4.183785676956177, LR: 5e-06, Elapsed Time: 2019.28 seconds
Step 88100/150000, Loss: 4.153958985805511, Test Loss: 4.183819830417633, LR: 5e-06, Elapsed Time: 2021.55 seconds
Step 88200/150000, Loss: 4.164287581443786, Test Loss: 4.183813691139221, LR: 5e-06, Elapsed Time: 2023.80 seconds
Step 88300/150000, Loss: 4.157302579879761, Test Loss: 4.18381541967392, LR: 5e-06, Elapsed Time: 2026.06 seconds
Step 88400/150000, Loss: 4.150865890979767, Test Loss: 4.183937609195709, LR: 5e-06, Elapsed Time: 2028.32 seconds
Step 88500/150000, Loss: 4.139501476287842, Test Loss: 4.183914661407471, LR: 5e-06, Elapsed Time: 2030.58 seconds
Step 88600/150000, Loss: 4.152716546058655, Test Loss: 4.183979272842407, LR: 5e-06, Elapsed Time: 2032.84 seconds
Step 88700/150000, Loss: 4.150099024772644, Test Loss: 4.183803200721741, LR: 5e-06, Elapsed Time: 2035.11 seconds
Step 88800/150000, Loss: 4.145555081367493, Test Loss: 4.183856308460236, LR: 5e-06, Elapsed Time: 2037.37 seconds
Step 88900/150000, Loss: 4.148350715637207, Test Loss: 4.183845579624176, LR: 5e-06, Elapsed Time: 2039.62 seconds
Step 89000/150000, Loss: 4.168212909698486, Test Loss: 4.183781623840332, LR: 5e-06, Elapsed Time: 2041.87 seconds
Step 89100/150000, Loss: 4.150783843994141, Test Loss: 4.183813631534576, LR: 5e-06, Elapsed Time: 2044.13 seconds
Step 89200/150000, Loss: 4.152151687145233, Test Loss: 4.183742582798004, LR: 5e-06, Elapsed Time: 2046.38 seconds
Step 89300/150000, Loss: 4.159282503128051, Test Loss: 4.183992564678192, LR: 5e-06, Elapsed Time: 2048.63 seconds
Step 89400/150000, Loss: 4.1408341693878175, Test Loss: 4.1839078068733215, LR: 5e-06, Elapsed Time: 2050.90 seconds
Step 89500/150000, Loss: 4.149708156585693, Test Loss: 4.183818161487579, LR: 5e-06, Elapsed Time: 2053.15 seconds
Step 89600/150000, Loss: 4.164428277015686, Test Loss: 4.183773219585419, LR: 5e-06, Elapsed Time: 2055.41 seconds
Step 89700/150000, Loss: 4.146211037635803, Test Loss: 4.1836448311805725, LR: 5e-06, Elapsed Time: 2057.67 seconds
Step 89800/150000, Loss: 4.154671576023102, Test Loss: 4.183655142784119, LR: 5e-06, Elapsed Time: 2059.93 seconds
Step 89900/150000, Loss: 4.152160489559174, Test Loss: 4.183666348457336, LR: 5e-06, Elapsed Time: 2062.19 seconds
Step 90000/150000, Loss: 4.15120795249939, Test Loss: 4.1836735010147095, LR: 5e-06, Elapsed Time: 2064.45 seconds
Step 90100/150000, Loss: 4.146942694187164, Test Loss: 4.183748006820679, LR: 5e-06, Elapsed Time: 2066.72 seconds
Step 90200/150000, Loss: 4.1466611385345455, Test Loss: 4.183781325817108, LR: 5e-06, Elapsed Time: 2068.97 seconds
Step 90300/150000, Loss: 4.144896731376648, Test Loss: 4.183779418468475, LR: 5e-06, Elapsed Time: 2071.22 seconds
Step 90400/150000, Loss: 4.1472083902359005, Test Loss: 4.183737754821777, LR: 5e-06, Elapsed Time: 2073.48 seconds
Step 90500/150000, Loss: 4.143563628196716, Test Loss: 4.1837891936302185, LR: 5e-06, Elapsed Time: 2075.73 seconds
Step 90600/150000, Loss: 4.144126205444336, Test Loss: 4.183690130710602, LR: 5e-06, Elapsed Time: 2077.99 seconds
Step 90700/150000, Loss: 4.152432131767273, Test Loss: 4.183613836765289, LR: 5e-06, Elapsed Time: 2080.25 seconds
Step 90800/150000, Loss: 4.14737756729126, Test Loss: 4.183684706687927, LR: 5e-06, Elapsed Time: 2082.51 seconds
Step 90900/150000, Loss: 4.1547225856781, Test Loss: 4.183634161949158, LR: 5e-06, Elapsed Time: 2084.76 seconds
Step 91000/150000, Loss: 4.166209473609924, Test Loss: 4.183639109134674, LR: 5e-06, Elapsed Time: 2087.02 seconds
Step 91100/150000, Loss: 4.166218295097351, Test Loss: 4.183577358722687, LR: 5e-06, Elapsed Time: 2089.28 seconds
Step 91200/150000, Loss: 4.1572181057929996, Test Loss: 4.183479726314545, LR: 5e-06, Elapsed Time: 2091.54 seconds
Step 91300/150000, Loss: 4.156561076641083, Test Loss: 4.183564603328705, LR: 5e-06, Elapsed Time: 2093.79 seconds
Step 91400/150000, Loss: 4.1613597536087035, Test Loss: 4.1836132407188416, LR: 5e-06, Elapsed Time: 2096.05 seconds
Step 91500/150000, Loss: 4.164888398647308, Test Loss: 4.1835437417030334, LR: 5e-06, Elapsed Time: 2098.30 seconds
Step 91600/150000, Loss: 4.148808798789978, Test Loss: 4.183633506298065, LR: 5e-06, Elapsed Time: 2100.56 seconds
Step 91700/150000, Loss: 4.153900547027588, Test Loss: 4.183574736118317, LR: 5e-06, Elapsed Time: 2102.82 seconds
Step 91800/150000, Loss: 4.1588416147232055, Test Loss: 4.1834452748298645, LR: 5e-06, Elapsed Time: 2105.08 seconds
Step 91900/150000, Loss: 4.153540086746216, Test Loss: 4.183582305908203, LR: 5e-06, Elapsed Time: 2107.35 seconds
Step 92000/150000, Loss: 4.151847333908081, Test Loss: 4.183577001094818, LR: 5e-06, Elapsed Time: 2109.61 seconds
Step 92100/150000, Loss: 4.160052318572998, Test Loss: 4.183630883693695, LR: 5e-06, Elapsed Time: 2111.87 seconds
Step 92200/150000, Loss: 4.159687283039093, Test Loss: 4.183535873889923, LR: 5e-06, Elapsed Time: 2114.13 seconds
Step 92300/150000, Loss: 4.163098940849304, Test Loss: 4.183574140071869, LR: 5e-06, Elapsed Time: 2116.38 seconds
Step 92400/150000, Loss: 4.163488910198212, Test Loss: 4.183368504047394, LR: 5e-06, Elapsed Time: 2118.64 seconds
Step 92500/150000, Loss: 4.142394866943359, Test Loss: 4.183571219444275, LR: 5e-06, Elapsed Time: 2120.89 seconds
Step 92600/150000, Loss: 4.146866979598999, Test Loss: 4.183548986911774, LR: 5e-06, Elapsed Time: 2123.15 seconds
Step 92700/150000, Loss: 4.160304365158081, Test Loss: 4.18349426984787, LR: 5e-06, Elapsed Time: 2125.41 seconds
Step 92800/150000, Loss: 4.1493293571472165, Test Loss: 4.183519005775452, LR: 5e-06, Elapsed Time: 2127.67 seconds
Step 92900/150000, Loss: 4.155476098060608, Test Loss: 4.18343722820282, LR: 5e-06, Elapsed Time: 2129.92 seconds
Step 93000/150000, Loss: 4.1512046146392825, Test Loss: 4.183416903018951, LR: 5e-06, Elapsed Time: 2132.18 seconds
Step 93100/150000, Loss: 4.159032220840454, Test Loss: 4.183447957038879, LR: 5e-06, Elapsed Time: 2134.43 seconds
Step 93200/150000, Loss: 4.1487909698486325, Test Loss: 4.183544456958771, LR: 5e-06, Elapsed Time: 2136.68 seconds
Step 93300/150000, Loss: 4.1415836405754085, Test Loss: 4.183523654937744, LR: 5e-06, Elapsed Time: 2138.93 seconds
Step 93400/150000, Loss: 4.153966732025147, Test Loss: 4.183479368686676, LR: 5e-06, Elapsed Time: 2141.18 seconds
Step 93500/150000, Loss: 4.1623581027984615, Test Loss: 4.183531105518341, LR: 5e-06, Elapsed Time: 2143.44 seconds
Step 93600/150000, Loss: 4.150136847496032, Test Loss: 4.183545231819153, LR: 5e-06, Elapsed Time: 2145.69 seconds
Step 93700/150000, Loss: 4.15760573387146, Test Loss: 4.1834911704063416, LR: 5e-06, Elapsed Time: 2147.94 seconds
Step 93800/150000, Loss: 4.156633849143982, Test Loss: 4.183627188205719, LR: 5e-06, Elapsed Time: 2150.19 seconds
Step 93900/150000, Loss: 4.14601954460144, Test Loss: 4.18370133638382, LR: 5e-06, Elapsed Time: 2152.45 seconds
Step 94000/150000, Loss: 4.1531533432006835, Test Loss: 4.183591663837433, LR: 5e-06, Elapsed Time: 2154.70 seconds
Step 94100/150000, Loss: 4.159339241981506, Test Loss: 4.183757185935974, LR: 5e-06, Elapsed Time: 2156.96 seconds
Step 94200/150000, Loss: 4.159772534370422, Test Loss: 4.183677673339844, LR: 5e-06, Elapsed Time: 2159.20 seconds
Step 94300/150000, Loss: 4.141176972389221, Test Loss: 4.183652222156525, LR: 5e-06, Elapsed Time: 2161.46 seconds
Step 94400/150000, Loss: 4.147612719535828, Test Loss: 4.183518052101135, LR: 5e-06, Elapsed Time: 2163.71 seconds
Step 94500/150000, Loss: 4.153586602210998, Test Loss: 4.183511316776276, LR: 5e-06, Elapsed Time: 2165.96 seconds
Step 94600/150000, Loss: 4.143413186073303, Test Loss: 4.183470606803894, LR: 5e-06, Elapsed Time: 2168.21 seconds
Step 94700/150000, Loss: 4.151809849739075, Test Loss: 4.183469653129578, LR: 5e-06, Elapsed Time: 2170.47 seconds
Step 94800/150000, Loss: 4.144912786483765, Test Loss: 4.183407485485077, LR: 5e-06, Elapsed Time: 2172.72 seconds
Step 94900/150000, Loss: 4.1524891185760495, Test Loss: 4.18336409330368, LR: 5e-06, Elapsed Time: 2174.98 seconds
Step 95000/150000, Loss: 4.143587417602539, Test Loss: 4.183496832847595, LR: 5e-06, Elapsed Time: 2177.23 seconds
Step 95100/150000, Loss: 4.148751626014709, Test Loss: 4.18340665102005, LR: 5e-06, Elapsed Time: 2179.49 seconds
Step 95200/150000, Loss: 4.154482169151306, Test Loss: 4.1832475662231445, LR: 5e-06, Elapsed Time: 2181.75 seconds
Step 95300/150000, Loss: 4.137493896484375, Test Loss: 4.183410286903381, LR: 5e-06, Elapsed Time: 2184.00 seconds
Step 95400/150000, Loss: 4.147489614486695, Test Loss: 4.183423221111298, LR: 5e-06, Elapsed Time: 2186.27 seconds
Step 95500/150000, Loss: 4.137350788116455, Test Loss: 4.183321833610535, LR: 5e-06, Elapsed Time: 2188.52 seconds
Step 95600/150000, Loss: 4.148268222808838, Test Loss: 4.183333575725555, LR: 5e-06, Elapsed Time: 2190.78 seconds
Step 95700/150000, Loss: 4.145018610954285, Test Loss: 4.183397889137268, LR: 5e-06, Elapsed Time: 2193.03 seconds
Step 95800/150000, Loss: 4.1544324398040775, Test Loss: 4.1832996010780334, LR: 5e-06, Elapsed Time: 2195.29 seconds
Step 95900/150000, Loss: 4.143790442943573, Test Loss: 4.183314323425293, LR: 5e-06, Elapsed Time: 2197.55 seconds
Step 96000/150000, Loss: 4.150312061309815, Test Loss: 4.1833232045173645, LR: 5e-06, Elapsed Time: 2199.80 seconds
Step 96100/150000, Loss: 4.144095764160157, Test Loss: 4.183301091194153, LR: 5e-06, Elapsed Time: 2202.05 seconds
Step 96200/150000, Loss: 4.1410860824584965, Test Loss: 4.183230519294739, LR: 5e-06, Elapsed Time: 2204.30 seconds
Step 96300/150000, Loss: 4.136436696052551, Test Loss: 4.183328568935394, LR: 5e-06, Elapsed Time: 2206.56 seconds
Step 96400/150000, Loss: 4.14696117401123, Test Loss: 4.1834517121315, LR: 5e-06, Elapsed Time: 2208.81 seconds
Step 96500/150000, Loss: 4.152370727062225, Test Loss: 4.183340787887573, LR: 5e-06, Elapsed Time: 2211.07 seconds
Step 96600/150000, Loss: 4.138738629817962, Test Loss: 4.183277189731598, LR: 5e-06, Elapsed Time: 2213.33 seconds
Step 96700/150000, Loss: 4.148123893737793, Test Loss: 4.183371961116791, LR: 5e-06, Elapsed Time: 2215.59 seconds
Step 96800/150000, Loss: 4.149331035614014, Test Loss: 4.183218777179718, LR: 5e-06, Elapsed Time: 2217.85 seconds
Step 96900/150000, Loss: 4.1331688594818115, Test Loss: 4.183494389057159, LR: 5e-06, Elapsed Time: 2220.12 seconds
Step 97000/150000, Loss: 4.129537415504456, Test Loss: 4.18336820602417, LR: 5e-06, Elapsed Time: 2222.37 seconds
Step 97100/150000, Loss: 4.145774412155151, Test Loss: 4.183375716209412, LR: 5e-06, Elapsed Time: 2224.63 seconds
Step 97200/150000, Loss: 4.141331381797791, Test Loss: 4.1834118366241455, LR: 5e-06, Elapsed Time: 2226.89 seconds
Step 97300/150000, Loss: 4.133864457607269, Test Loss: 4.183435499668121, LR: 5e-06, Elapsed Time: 2229.14 seconds
Step 97400/150000, Loss: 4.137330584526062, Test Loss: 4.183455765247345, LR: 5e-06, Elapsed Time: 2231.41 seconds
Step 97500/150000, Loss: 4.133285634517669, Test Loss: 4.183301746845245, LR: 5e-06, Elapsed Time: 2233.66 seconds
Step 97600/150000, Loss: 4.144105651378632, Test Loss: 4.183429062366486, LR: 5e-06, Elapsed Time: 2235.92 seconds
Step 97700/150000, Loss: 4.145221314430237, Test Loss: 4.18340539932251, LR: 5e-06, Elapsed Time: 2238.18 seconds
Step 97800/150000, Loss: 4.136519889831543, Test Loss: 4.183468163013458, LR: 5e-06, Elapsed Time: 2240.44 seconds
Step 97900/150000, Loss: 4.14711178779602, Test Loss: 4.183388352394104, LR: 5e-06, Elapsed Time: 2242.71 seconds
Step 98000/150000, Loss: 4.1363042449951175, Test Loss: 4.1832743883132935, LR: 5e-06, Elapsed Time: 2244.96 seconds
Step 98100/150000, Loss: 4.14573924779892, Test Loss: 4.183207273483276, LR: 5e-06, Elapsed Time: 2247.22 seconds
Step 98200/150000, Loss: 4.12712021112442, Test Loss: 4.183431684970856, LR: 5e-06, Elapsed Time: 2249.47 seconds
Step 98300/150000, Loss: 4.148455491065979, Test Loss: 4.183330535888672, LR: 5e-06, Elapsed Time: 2251.73 seconds
Step 98400/150000, Loss: 4.149047136306763, Test Loss: 4.18329918384552, LR: 5e-06, Elapsed Time: 2253.99 seconds
Step 98500/150000, Loss: 4.14116295337677, Test Loss: 4.183337092399597, LR: 5e-06, Elapsed Time: 2256.25 seconds
Step 98600/150000, Loss: 4.122706413269043, Test Loss: 4.183327257633209, LR: 5e-06, Elapsed Time: 2258.51 seconds
Step 98700/150000, Loss: 4.1349200654029845, Test Loss: 4.183311998844147, LR: 5e-06, Elapsed Time: 2260.78 seconds
Step 98800/150000, Loss: 4.13211941242218, Test Loss: 4.1833677887916565, LR: 5e-06, Elapsed Time: 2263.01 seconds
Step 98900/150000, Loss: 4.130132937431336, Test Loss: 4.183439254760742, LR: 5e-06, Elapsed Time: 2265.26 seconds
Step 99000/150000, Loss: 4.137482266426087, Test Loss: 4.183430910110474, LR: 5e-06, Elapsed Time: 2267.52 seconds
Step 99100/150000, Loss: 4.135607285499573, Test Loss: 4.183497905731201, LR: 5e-06, Elapsed Time: 2269.78 seconds
Step 99200/150000, Loss: 4.1319965386390685, Test Loss: 4.183333694934845, LR: 5e-06, Elapsed Time: 2272.03 seconds
Step 99300/150000, Loss: 4.131804468631745, Test Loss: 4.183463931083679, LR: 5e-06, Elapsed Time: 2274.29 seconds
Step 99400/150000, Loss: 4.123924491405487, Test Loss: 4.1835731863975525, LR: 5e-06, Elapsed Time: 2276.54 seconds
Step 99500/150000, Loss: 4.153701310157776, Test Loss: 4.183386862277985, LR: 5e-06, Elapsed Time: 2278.80 seconds
Step 99600/150000, Loss: 4.165082664489746, Test Loss: 4.183206021785736, LR: 5e-06, Elapsed Time: 2281.06 seconds
Step 99700/150000, Loss: 4.170464701652527, Test Loss: 4.183067500591278, LR: 5e-06, Elapsed Time: 2283.31 seconds
Step 99800/150000, Loss: 4.164787769317627, Test Loss: 4.183200180530548, LR: 5e-06, Elapsed Time: 2285.57 seconds
Step 99900/150000, Loss: 4.15875910282135, Test Loss: 4.183159351348877, LR: 5e-06, Elapsed Time: 2287.84 seconds
Step 100000/150000, Loss: 4.157420461177826, Test Loss: 4.183087050914764, LR: 5e-06, Elapsed Time: 2290.09 seconds
Saving model checkpoint at step 100000
Step 100100/150000, Loss: 4.166806497573853, Test Loss: 4.1830713748931885, LR: 5e-06, Elapsed Time: 2292.42 seconds
Step 100200/150000, Loss: 4.159164929389954, Test Loss: 4.183219075202942, LR: 5e-06, Elapsed Time: 2294.68 seconds
Step 100300/150000, Loss: 4.162308721542359, Test Loss: 4.183288931846619, LR: 5e-06, Elapsed Time: 2296.95 seconds
Step 100400/150000, Loss: 4.1558391690254215, Test Loss: 4.183073103427887, LR: 5e-06, Elapsed Time: 2299.20 seconds
Step 100500/150000, Loss: 4.16154237985611, Test Loss: 4.1831682324409485, LR: 5e-06, Elapsed Time: 2301.47 seconds
Step 100600/150000, Loss: 4.1520200943946834, Test Loss: 4.183123767375946, LR: 5e-06, Elapsed Time: 2303.73 seconds
Step 100700/150000, Loss: 4.155621244907379, Test Loss: 4.18318635225296, LR: 5e-06, Elapsed Time: 2305.99 seconds
Step 100800/150000, Loss: 4.166549751758575, Test Loss: 4.18325012922287, LR: 5e-06, Elapsed Time: 2308.24 seconds
Step 100900/150000, Loss: 4.161137456893921, Test Loss: 4.183200180530548, LR: 5e-06, Elapsed Time: 2310.50 seconds
Step 101000/150000, Loss: 4.153509712219238, Test Loss: 4.1831947565078735, LR: 5e-06, Elapsed Time: 2312.75 seconds
Step 101100/150000, Loss: 4.149085068702698, Test Loss: 4.183147132396698, LR: 5e-06, Elapsed Time: 2315.01 seconds
Step 101200/150000, Loss: 4.161673545837402, Test Loss: 4.183074116706848, LR: 5e-06, Elapsed Time: 2317.26 seconds
Step 101300/150000, Loss: 4.150198931694031, Test Loss: 4.1830021142959595, LR: 5e-06, Elapsed Time: 2319.51 seconds
Step 101400/150000, Loss: 4.156889805793762, Test Loss: 4.182923853397369, LR: 5e-06, Elapsed Time: 2321.77 seconds
Step 101500/150000, Loss: 4.164011716842651, Test Loss: 4.1828736662864685, LR: 5e-06, Elapsed Time: 2324.02 seconds
Step 101600/150000, Loss: 4.153506937026978, Test Loss: 4.182870268821716, LR: 5e-06, Elapsed Time: 2326.28 seconds
Step 101700/150000, Loss: 4.154548490047455, Test Loss: 4.182972967624664, LR: 5e-06, Elapsed Time: 2328.54 seconds
Step 101800/150000, Loss: 4.159119505882263, Test Loss: 4.183104932308197, LR: 5e-06, Elapsed Time: 2330.80 seconds
Step 101900/150000, Loss: 4.155282237529755, Test Loss: 4.183022677898407, LR: 5e-06, Elapsed Time: 2333.06 seconds
Step 102000/150000, Loss: 4.160636701583862, Test Loss: 4.183031737804413, LR: 5e-06, Elapsed Time: 2335.32 seconds
Step 102100/150000, Loss: 4.151993651390075, Test Loss: 4.183083176612854, LR: 5e-06, Elapsed Time: 2337.58 seconds
Step 102200/150000, Loss: 4.1405109429359435, Test Loss: 4.18294095993042, LR: 5e-06, Elapsed Time: 2339.84 seconds
Step 102300/150000, Loss: 4.153458437919617, Test Loss: 4.183091461658478, LR: 5e-06, Elapsed Time: 2342.11 seconds
Step 102400/150000, Loss: 4.155075144767761, Test Loss: 4.183172821998596, LR: 5e-06, Elapsed Time: 2344.36 seconds
Step 102500/150000, Loss: 4.14851227760315, Test Loss: 4.183209240436554, LR: 5e-06, Elapsed Time: 2346.63 seconds
Step 102600/150000, Loss: 4.158660802841187, Test Loss: 4.183114230632782, LR: 5e-06, Elapsed Time: 2348.88 seconds
Step 102700/150000, Loss: 4.148131721019745, Test Loss: 4.183093965053558, LR: 5e-06, Elapsed Time: 2351.14 seconds
Step 102800/150000, Loss: 4.147079696655274, Test Loss: 4.183086097240448, LR: 5e-06, Elapsed Time: 2353.39 seconds
Step 102900/150000, Loss: 4.144713034629822, Test Loss: 4.183130741119385, LR: 5e-06, Elapsed Time: 2355.65 seconds
Step 103000/150000, Loss: 4.164974069595337, Test Loss: 4.183102548122406, LR: 5e-06, Elapsed Time: 2357.90 seconds
Step 103100/150000, Loss: 4.139935913085938, Test Loss: 4.183033466339111, LR: 5e-06, Elapsed Time: 2360.16 seconds
Step 103200/150000, Loss: 4.167913370132446, Test Loss: 4.182942271232605, LR: 5e-06, Elapsed Time: 2362.42 seconds
Step 103300/150000, Loss: 4.157184658050537, Test Loss: 4.183043718338013, LR: 5e-06, Elapsed Time: 2364.66 seconds
Step 103400/150000, Loss: 4.13917462348938, Test Loss: 4.182898700237274, LR: 5e-06, Elapsed Time: 2366.93 seconds
Step 103500/150000, Loss: 4.1383452582359315, Test Loss: 4.183037161827087, LR: 5e-06, Elapsed Time: 2369.20 seconds
Step 103600/150000, Loss: 4.155209546089172, Test Loss: 4.182919442653656, LR: 5e-06, Elapsed Time: 2371.46 seconds
Step 103700/150000, Loss: 4.155232775211334, Test Loss: 4.1829511523246765, LR: 5e-06, Elapsed Time: 2373.72 seconds
Step 103800/150000, Loss: 4.150090489387512, Test Loss: 4.182849884033203, LR: 5e-06, Elapsed Time: 2375.98 seconds
Step 103900/150000, Loss: 4.1549149203300475, Test Loss: 4.182732343673706, LR: 5e-06, Elapsed Time: 2378.24 seconds
Step 104000/150000, Loss: 4.15147510766983, Test Loss: 4.182760536670685, LR: 5e-06, Elapsed Time: 2380.50 seconds
Step 104100/150000, Loss: 4.149536352157593, Test Loss: 4.182769298553467, LR: 5e-06, Elapsed Time: 2382.76 seconds
Step 104200/150000, Loss: 4.154743061065674, Test Loss: 4.182685673236847, LR: 5e-06, Elapsed Time: 2385.02 seconds
Step 104300/150000, Loss: 4.153319602012634, Test Loss: 4.18281626701355, LR: 5e-06, Elapsed Time: 2387.29 seconds
Step 104400/150000, Loss: 4.149418239593506, Test Loss: 4.182944059371948, LR: 5e-06, Elapsed Time: 2389.55 seconds
Step 104500/150000, Loss: 4.146622533798218, Test Loss: 4.182862281799316, LR: 5e-06, Elapsed Time: 2391.81 seconds
Step 104600/150000, Loss: 4.140997390747071, Test Loss: 4.1828389167785645, LR: 5e-06, Elapsed Time: 2394.07 seconds
Step 104700/150000, Loss: 4.156354575157166, Test Loss: 4.18295294046402, LR: 5e-06, Elapsed Time: 2396.33 seconds
Step 104800/150000, Loss: 4.156742265224457, Test Loss: 4.182774007320404, LR: 5e-06, Elapsed Time: 2398.59 seconds
Step 104900/150000, Loss: 4.1520990371704105, Test Loss: 4.182737171649933, LR: 5e-06, Elapsed Time: 2400.86 seconds
Step 105000/150000, Loss: 4.1269047069549565, Test Loss: 4.182772874832153, LR: 5e-06, Elapsed Time: 2403.11 seconds
Step 105100/150000, Loss: 4.143224565982819, Test Loss: 4.182854235172272, LR: 5e-06, Elapsed Time: 2405.37 seconds
Step 105200/150000, Loss: 4.142754158973694, Test Loss: 4.1828614473342896, LR: 5e-06, Elapsed Time: 2407.62 seconds
Step 105300/150000, Loss: 4.156931076049805, Test Loss: 4.18276172876358, LR: 5e-06, Elapsed Time: 2409.87 seconds
Step 105400/150000, Loss: 4.1452416324615475, Test Loss: 4.182724714279175, LR: 5e-06, Elapsed Time: 2412.13 seconds
Step 105500/150000, Loss: 4.148775811195374, Test Loss: 4.182616174221039, LR: 5e-06, Elapsed Time: 2414.38 seconds
Step 105600/150000, Loss: 4.147649869918824, Test Loss: 4.182743072509766, LR: 5e-06, Elapsed Time: 2416.64 seconds
Step 105700/150000, Loss: 4.146111102104187, Test Loss: 4.182651400566101, LR: 5e-06, Elapsed Time: 2418.89 seconds
Step 105800/150000, Loss: 4.136842861175537, Test Loss: 4.182799458503723, LR: 5e-06, Elapsed Time: 2421.14 seconds
Step 105900/150000, Loss: 4.139220621585846, Test Loss: 4.18272477388382, LR: 5e-06, Elapsed Time: 2423.40 seconds
Step 106000/150000, Loss: 4.1518226861953735, Test Loss: 4.182726562023163, LR: 5e-06, Elapsed Time: 2425.65 seconds
Step 106100/150000, Loss: 4.150581097602844, Test Loss: 4.182570993900299, LR: 5e-06, Elapsed Time: 2427.89 seconds
Step 106200/150000, Loss: 4.155951471328735, Test Loss: 4.182603359222412, LR: 5e-06, Elapsed Time: 2430.15 seconds
Step 106300/150000, Loss: 4.144221453666687, Test Loss: 4.182539641857147, LR: 5e-06, Elapsed Time: 2432.40 seconds
Step 106400/150000, Loss: 4.146421308517456, Test Loss: 4.182624638080597, LR: 5e-06, Elapsed Time: 2434.65 seconds
Step 106500/150000, Loss: 4.134832427501679, Test Loss: 4.182601988315582, LR: 5e-06, Elapsed Time: 2436.91 seconds
Step 106600/150000, Loss: 4.146899452209473, Test Loss: 4.182632863521576, LR: 5e-06, Elapsed Time: 2439.16 seconds
Step 106700/150000, Loss: 4.138580913543701, Test Loss: 4.182562053203583, LR: 5e-06, Elapsed Time: 2441.41 seconds
Step 106800/150000, Loss: 4.1315593957901005, Test Loss: 4.182645380496979, LR: 5e-06, Elapsed Time: 2443.66 seconds
Step 106900/150000, Loss: 4.144743571281433, Test Loss: 4.182561278343201, LR: 5e-06, Elapsed Time: 2445.91 seconds
Step 107000/150000, Loss: 4.142760043144226, Test Loss: 4.182444155216217, LR: 5e-06, Elapsed Time: 2448.16 seconds
Step 107100/150000, Loss: 4.139929027557373, Test Loss: 4.182643353939056, LR: 5e-06, Elapsed Time: 2450.42 seconds
Step 107200/150000, Loss: 4.132779281139374, Test Loss: 4.182783603668213, LR: 5e-06, Elapsed Time: 2452.68 seconds
Step 107300/150000, Loss: 4.147402863502503, Test Loss: 4.182767808437347, LR: 5e-06, Elapsed Time: 2454.92 seconds
Step 107400/150000, Loss: 4.169334177970886, Test Loss: 4.18264365196228, LR: 5e-06, Elapsed Time: 2457.18 seconds
Step 107500/150000, Loss: 4.168093242645264, Test Loss: 4.182714760303497, LR: 5e-06, Elapsed Time: 2459.43 seconds
Step 107600/150000, Loss: 4.153196883201599, Test Loss: 4.182579815387726, LR: 5e-06, Elapsed Time: 2461.69 seconds
Step 107700/150000, Loss: 4.15484397649765, Test Loss: 4.18260645866394, LR: 5e-06, Elapsed Time: 2463.94 seconds
Step 107800/150000, Loss: 4.148784837722778, Test Loss: 4.182688415050507, LR: 5e-06, Elapsed Time: 2466.20 seconds
Step 107900/150000, Loss: 4.161658406257629, Test Loss: 4.18259996175766, LR: 5e-06, Elapsed Time: 2468.45 seconds
Step 108000/150000, Loss: 4.164555840492248, Test Loss: 4.182552754878998, LR: 5e-06, Elapsed Time: 2470.71 seconds
Step 108100/150000, Loss: 4.149423875808716, Test Loss: 4.182555258274078, LR: 5e-06, Elapsed Time: 2472.96 seconds
Step 108200/150000, Loss: 4.1666610193252565, Test Loss: 4.18252694606781, LR: 5e-06, Elapsed Time: 2475.21 seconds
Step 108300/150000, Loss: 4.146856908798218, Test Loss: 4.182646632194519, LR: 5e-06, Elapsed Time: 2477.47 seconds
Step 108400/150000, Loss: 4.147478275299072, Test Loss: 4.182528555393219, LR: 5e-06, Elapsed Time: 2479.73 seconds
Step 108500/150000, Loss: 4.169182720184327, Test Loss: 4.1823766231536865, LR: 5e-06, Elapsed Time: 2481.97 seconds
Step 108600/150000, Loss: 4.164214930534363, Test Loss: 4.1825501918792725, LR: 5e-06, Elapsed Time: 2484.22 seconds
Step 108700/150000, Loss: 4.165402431488037, Test Loss: 4.1824668645858765, LR: 5e-06, Elapsed Time: 2486.47 seconds
Step 108800/150000, Loss: 4.164918065071106, Test Loss: 4.182389497756958, LR: 5e-06, Elapsed Time: 2488.71 seconds
Step 108900/150000, Loss: 4.158822455406189, Test Loss: 4.182377517223358, LR: 5e-06, Elapsed Time: 2490.97 seconds
Step 109000/150000, Loss: 4.175883777141571, Test Loss: 4.1824010014534, LR: 5e-06, Elapsed Time: 2493.22 seconds
Step 109100/150000, Loss: 4.151304926872253, Test Loss: 4.182471811771393, LR: 5e-06, Elapsed Time: 2495.47 seconds
Step 109200/150000, Loss: 4.163420929908752, Test Loss: 4.182552933692932, LR: 5e-06, Elapsed Time: 2497.72 seconds
Step 109300/150000, Loss: 4.145566349029541, Test Loss: 4.182755529880524, LR: 5e-06, Elapsed Time: 2499.97 seconds
Step 109400/150000, Loss: 4.159655032157898, Test Loss: 4.182538211345673, LR: 5e-06, Elapsed Time: 2502.23 seconds
Step 109500/150000, Loss: 4.151295309066772, Test Loss: 4.182413578033447, LR: 5e-06, Elapsed Time: 2504.48 seconds
Step 109600/150000, Loss: 4.151892287731171, Test Loss: 4.1823039054870605, LR: 5e-06, Elapsed Time: 2506.74 seconds
Step 109700/150000, Loss: 4.15936960697174, Test Loss: 4.182279109954834, LR: 5e-06, Elapsed Time: 2509.00 seconds
Step 109800/150000, Loss: 4.156202569007873, Test Loss: 4.182223737239838, LR: 5e-06, Elapsed Time: 2511.26 seconds
Step 109900/150000, Loss: 4.161330714225769, Test Loss: 4.182231307029724, LR: 5e-06, Elapsed Time: 2513.51 seconds
Step 110000/150000, Loss: 4.152190790176392, Test Loss: 4.182163655757904, LR: 5e-06, Elapsed Time: 2515.77 seconds
Step 110100/150000, Loss: 4.1610349702835085, Test Loss: 4.18232935667038, LR: 5e-06, Elapsed Time: 2518.02 seconds
Step 110200/150000, Loss: 4.15615481376648, Test Loss: 4.182272791862488, LR: 5e-06, Elapsed Time: 2520.28 seconds
Step 110300/150000, Loss: 4.146832919120788, Test Loss: 4.1823113560676575, LR: 5e-06, Elapsed Time: 2522.54 seconds
Step 110400/150000, Loss: 4.135546026229858, Test Loss: 4.182499408721924, LR: 5e-06, Elapsed Time: 2524.79 seconds
Step 110500/150000, Loss: 4.155034236907959, Test Loss: 4.182358622550964, LR: 5e-06, Elapsed Time: 2527.05 seconds
Step 110600/150000, Loss: 4.1434040546417235, Test Loss: 4.182300448417664, LR: 5e-06, Elapsed Time: 2529.31 seconds
Step 110700/150000, Loss: 4.147428193092346, Test Loss: 4.182361483573914, LR: 5e-06, Elapsed Time: 2531.58 seconds
Step 110800/150000, Loss: 4.157831814289093, Test Loss: 4.182270884513855, LR: 5e-06, Elapsed Time: 2533.83 seconds
Step 110900/150000, Loss: 4.157587933540344, Test Loss: 4.182239592075348, LR: 5e-06, Elapsed Time: 2536.08 seconds
Step 111000/150000, Loss: 4.148485136032105, Test Loss: 4.182348310947418, LR: 5e-06, Elapsed Time: 2538.34 seconds
Step 111100/150000, Loss: 4.157586669921875, Test Loss: 4.182294547557831, LR: 5e-06, Elapsed Time: 2540.59 seconds
Step 111200/150000, Loss: 4.15070599079132, Test Loss: 4.182343780994415, LR: 5e-06, Elapsed Time: 2542.85 seconds
Step 111300/150000, Loss: 4.1476256465911865, Test Loss: 4.1823365688323975, LR: 5e-06, Elapsed Time: 2545.10 seconds
Step 111400/150000, Loss: 4.1499711656570435, Test Loss: 4.182194888591766, LR: 5e-06, Elapsed Time: 2547.37 seconds
Step 111500/150000, Loss: 4.156613259315491, Test Loss: 4.182305991649628, LR: 5e-06, Elapsed Time: 2549.62 seconds
Step 111600/150000, Loss: 4.1429481291770935, Test Loss: 4.182067334651947, LR: 5e-06, Elapsed Time: 2551.87 seconds
Step 111700/150000, Loss: 4.155889172554016, Test Loss: 4.1821237206459045, LR: 5e-06, Elapsed Time: 2554.13 seconds
Step 111800/150000, Loss: 4.157183351516724, Test Loss: 4.18218606710434, LR: 5e-06, Elapsed Time: 2556.39 seconds
Step 111900/150000, Loss: 4.146600050926208, Test Loss: 4.182138741016388, LR: 5e-06, Elapsed Time: 2558.65 seconds
Step 112000/150000, Loss: 4.146382441520691, Test Loss: 4.1821916699409485, LR: 5e-06, Elapsed Time: 2560.91 seconds
Step 112100/150000, Loss: 4.144274435043335, Test Loss: 4.182193160057068, LR: 5e-06, Elapsed Time: 2563.15 seconds
Step 112200/150000, Loss: 4.143755121231079, Test Loss: 4.182167291641235, LR: 5e-06, Elapsed Time: 2565.41 seconds
Step 112300/150000, Loss: 4.14727258682251, Test Loss: 4.18217271566391, LR: 5e-06, Elapsed Time: 2567.66 seconds
Step 112400/150000, Loss: 4.13538088798523, Test Loss: 4.182157278060913, LR: 5e-06, Elapsed Time: 2569.92 seconds
Step 112500/150000, Loss: 4.147740640640259, Test Loss: 4.182156264781952, LR: 5e-06, Elapsed Time: 2572.17 seconds
Step 112600/150000, Loss: 4.15642023563385, Test Loss: 4.182025969028473, LR: 5e-06, Elapsed Time: 2574.42 seconds
Step 112700/150000, Loss: 4.150073022842407, Test Loss: 4.18211567401886, LR: 5e-06, Elapsed Time: 2576.69 seconds
Step 112800/150000, Loss: 4.15554172039032, Test Loss: 4.182019531726837, LR: 5e-06, Elapsed Time: 2578.95 seconds
Step 112900/150000, Loss: 4.165143713951111, Test Loss: 4.1819270849227905, LR: 5e-06, Elapsed Time: 2581.20 seconds
Step 113000/150000, Loss: 4.157542719841003, Test Loss: 4.1819576025009155, LR: 5e-06, Elapsed Time: 2583.47 seconds
Step 113100/150000, Loss: 4.15804057598114, Test Loss: 4.181941449642181, LR: 5e-06, Elapsed Time: 2585.73 seconds
Step 113200/150000, Loss: 4.157160432338714, Test Loss: 4.181927680969238, LR: 5e-06, Elapsed Time: 2587.98 seconds
Step 113300/150000, Loss: 4.149738156795502, Test Loss: 4.182100892066956, LR: 5e-06, Elapsed Time: 2590.23 seconds
Step 113400/150000, Loss: 4.165736474990845, Test Loss: 4.1819047927856445, LR: 5e-06, Elapsed Time: 2592.48 seconds
Step 113500/150000, Loss: 4.151355280876159, Test Loss: 4.182053804397583, LR: 5e-06, Elapsed Time: 2594.74 seconds
Step 113600/150000, Loss: 4.148731408119201, Test Loss: 4.182029902935028, LR: 5e-06, Elapsed Time: 2596.99 seconds
Step 113700/150000, Loss: 4.157613034248352, Test Loss: 4.181879878044128, LR: 5e-06, Elapsed Time: 2599.25 seconds
Step 113800/150000, Loss: 4.148982124328613, Test Loss: 4.1819621324539185, LR: 5e-06, Elapsed Time: 2601.51 seconds
Step 113900/150000, Loss: 4.151141030788422, Test Loss: 4.1820067167282104, LR: 5e-06, Elapsed Time: 2603.76 seconds
Step 114000/150000, Loss: 4.161500372886658, Test Loss: 4.182067692279816, LR: 5e-06, Elapsed Time: 2606.02 seconds
Step 114100/150000, Loss: 4.160679459571838, Test Loss: 4.182023227214813, LR: 5e-06, Elapsed Time: 2608.27 seconds
Step 114200/150000, Loss: 4.156543412208557, Test Loss: 4.182026207447052, LR: 5e-06, Elapsed Time: 2610.53 seconds
Step 114300/150000, Loss: 4.159319648742676, Test Loss: 4.181870400905609, LR: 5e-06, Elapsed Time: 2612.78 seconds
Step 114400/150000, Loss: 4.144652523994446, Test Loss: 4.181992948055267, LR: 5e-06, Elapsed Time: 2615.03 seconds
Step 114500/150000, Loss: 4.14619888305664, Test Loss: 4.18203204870224, LR: 5e-06, Elapsed Time: 2617.28 seconds
Step 114600/150000, Loss: 4.155129632949829, Test Loss: 4.18196713924408, LR: 5e-06, Elapsed Time: 2619.52 seconds
Step 114700/150000, Loss: 4.148270916938782, Test Loss: 4.182024896144867, LR: 5e-06, Elapsed Time: 2621.77 seconds
Step 114800/150000, Loss: 4.153865852355957, Test Loss: 4.181897938251495, LR: 5e-06, Elapsed Time: 2624.01 seconds
Step 114900/150000, Loss: 4.147097158432007, Test Loss: 4.181892454624176, LR: 5e-06, Elapsed Time: 2626.26 seconds
Step 115000/150000, Loss: 4.158718729019165, Test Loss: 4.181888163089752, LR: 5e-06, Elapsed Time: 2628.51 seconds
Step 115100/150000, Loss: 4.149253008365631, Test Loss: 4.181963562965393, LR: 5e-06, Elapsed Time: 2630.77 seconds
Step 115200/150000, Loss: 4.134530744552612, Test Loss: 4.181991100311279, LR: 5e-06, Elapsed Time: 2633.02 seconds
Step 115300/150000, Loss: 4.15236962556839, Test Loss: 4.182008862495422, LR: 5e-06, Elapsed Time: 2635.27 seconds
Step 115400/150000, Loss: 4.166981887817383, Test Loss: 4.182031035423279, LR: 5e-06, Elapsed Time: 2637.52 seconds
Step 115500/150000, Loss: 4.145422413349151, Test Loss: 4.182002425193787, LR: 5e-06, Elapsed Time: 2639.78 seconds
Step 115600/150000, Loss: 4.15366828918457, Test Loss: 4.181926906108856, LR: 5e-06, Elapsed Time: 2642.03 seconds
Step 115700/150000, Loss: 4.14850688457489, Test Loss: 4.18221390247345, LR: 5e-06, Elapsed Time: 2644.28 seconds
Step 115800/150000, Loss: 4.1477928280830385, Test Loss: 4.18206250667572, LR: 5e-06, Elapsed Time: 2646.54 seconds
Step 115900/150000, Loss: 4.153482587337494, Test Loss: 4.182090878486633, LR: 5e-06, Elapsed Time: 2648.80 seconds
Step 116000/150000, Loss: 4.161479394435883, Test Loss: 4.18221253156662, LR: 5e-06, Elapsed Time: 2651.06 seconds
Step 116100/150000, Loss: 4.1488297033309935, Test Loss: 4.182152330875397, LR: 5e-06, Elapsed Time: 2653.32 seconds
Step 116200/150000, Loss: 4.1394637894630435, Test Loss: 4.182214915752411, LR: 5e-06, Elapsed Time: 2655.59 seconds
Step 116300/150000, Loss: 4.14739560842514, Test Loss: 4.1819727420806885, LR: 5e-06, Elapsed Time: 2657.85 seconds
Step 116400/150000, Loss: 4.149780783653259, Test Loss: 4.182024717330933, LR: 5e-06, Elapsed Time: 2660.11 seconds
Step 116500/150000, Loss: 4.140627384185791, Test Loss: 4.181994557380676, LR: 5e-06, Elapsed Time: 2662.37 seconds
Step 116600/150000, Loss: 4.151362566947937, Test Loss: 4.18192458152771, LR: 5e-06, Elapsed Time: 2664.62 seconds
Step 116700/150000, Loss: 4.1385877990722655, Test Loss: 4.181865811347961, LR: 5e-06, Elapsed Time: 2666.87 seconds
Step 116800/150000, Loss: 4.158994064331055, Test Loss: 4.1818459033966064, LR: 5e-06, Elapsed Time: 2669.12 seconds
Step 116900/150000, Loss: 4.1364145398139955, Test Loss: 4.181944489479065, LR: 5e-06, Elapsed Time: 2671.38 seconds
Step 117000/150000, Loss: 4.143009347915649, Test Loss: 4.18189001083374, LR: 5e-06, Elapsed Time: 2673.63 seconds
Step 117100/150000, Loss: 4.152676014900208, Test Loss: 4.181855499744415, LR: 5e-06, Elapsed Time: 2675.89 seconds
Step 117200/150000, Loss: 4.132827467918396, Test Loss: 4.181959807872772, LR: 5e-06, Elapsed Time: 2678.15 seconds
Step 117300/150000, Loss: 4.150890145301819, Test Loss: 4.181816339492798, LR: 5e-06, Elapsed Time: 2680.40 seconds
Step 117400/150000, Loss: 4.134679813385009, Test Loss: 4.181865572929382, LR: 5e-06, Elapsed Time: 2682.66 seconds
Step 117500/150000, Loss: 4.147399582862854, Test Loss: 4.181855618953705, LR: 5e-06, Elapsed Time: 2684.92 seconds
Step 117600/150000, Loss: 4.145980162620544, Test Loss: 4.181827127933502, LR: 5e-06, Elapsed Time: 2687.20 seconds
Step 117700/150000, Loss: 4.14761791229248, Test Loss: 4.181865692138672, LR: 5e-06, Elapsed Time: 2689.46 seconds
Step 117800/150000, Loss: 4.142864291667938, Test Loss: 4.181860327720642, LR: 5e-06, Elapsed Time: 2691.72 seconds
Step 117900/150000, Loss: 4.146791317462921, Test Loss: 4.181806921958923, LR: 5e-06, Elapsed Time: 2693.97 seconds
Step 118000/150000, Loss: 4.140257749557495, Test Loss: 4.18182235956192, LR: 5e-06, Elapsed Time: 2696.23 seconds
Step 118100/150000, Loss: 4.13670841217041, Test Loss: 4.1817373633384705, LR: 5e-06, Elapsed Time: 2698.48 seconds
Step 118200/150000, Loss: 4.140655727386474, Test Loss: 4.1819493770599365, LR: 5e-06, Elapsed Time: 2700.73 seconds
Step 118300/150000, Loss: 4.144402906894684, Test Loss: 4.18203604221344, LR: 5e-06, Elapsed Time: 2702.98 seconds
Step 118400/150000, Loss: 4.144405262470245, Test Loss: 4.181841671466827, LR: 5e-06, Elapsed Time: 2705.24 seconds
Step 118500/150000, Loss: 4.140990843772888, Test Loss: 4.1818817257881165, LR: 5e-06, Elapsed Time: 2707.50 seconds
Step 118600/150000, Loss: 4.148238885402679, Test Loss: 4.181893825531006, LR: 5e-06, Elapsed Time: 2709.76 seconds
Step 118700/150000, Loss: 4.144192721843719, Test Loss: 4.181848168373108, LR: 5e-06, Elapsed Time: 2712.01 seconds
Step 118800/150000, Loss: 4.127857682704925, Test Loss: 4.182120144367218, LR: 5e-06, Elapsed Time: 2714.26 seconds
Step 118900/150000, Loss: 4.132006888389587, Test Loss: 4.181890547275543, LR: 5e-06, Elapsed Time: 2716.52 seconds
Step 119000/150000, Loss: 4.140169262886047, Test Loss: 4.181946516036987, LR: 5e-06, Elapsed Time: 2718.77 seconds
Step 119100/150000, Loss: 4.140655879974365, Test Loss: 4.18196314573288, LR: 5e-06, Elapsed Time: 2721.02 seconds
Step 119200/150000, Loss: 4.1339039659500125, Test Loss: 4.181931018829346, LR: 5e-06, Elapsed Time: 2723.28 seconds
Step 119300/150000, Loss: 4.13203373670578, Test Loss: 4.181909620761871, LR: 5e-06, Elapsed Time: 2725.54 seconds
Step 119400/150000, Loss: 4.132776193618774, Test Loss: 4.181890428066254, LR: 5e-06, Elapsed Time: 2727.79 seconds
Step 119500/150000, Loss: 4.14517247915268, Test Loss: 4.181988775730133, LR: 5e-06, Elapsed Time: 2730.04 seconds
Step 119600/150000, Loss: 4.140990190505981, Test Loss: 4.181983828544617, LR: 5e-06, Elapsed Time: 2732.30 seconds
Step 119700/150000, Loss: 4.135424752235412, Test Loss: 4.18206650018692, LR: 5e-06, Elapsed Time: 2734.56 seconds
Step 119800/150000, Loss: 4.145800905227661, Test Loss: 4.182049870491028, LR: 5e-06, Elapsed Time: 2736.82 seconds
Step 119900/150000, Loss: 4.138657898902893, Test Loss: 4.1817967891693115, LR: 5e-06, Elapsed Time: 2739.07 seconds
Step 120000/150000, Loss: 4.137166090011597, Test Loss: 4.181829333305359, LR: 5e-06, Elapsed Time: 2741.33 seconds
Step 120100/150000, Loss: 4.1335843205451965, Test Loss: 4.182016491889954, LR: 5e-06, Elapsed Time: 2743.58 seconds
Step 120200/150000, Loss: 4.140254406929016, Test Loss: 4.1819387674331665, LR: 5e-06, Elapsed Time: 2745.84 seconds
Step 120300/150000, Loss: 4.14901364326477, Test Loss: 4.181840121746063, LR: 5e-06, Elapsed Time: 2748.09 seconds
Step 120400/150000, Loss: 4.133540589809417, Test Loss: 4.181945025920868, LR: 5e-06, Elapsed Time: 2750.35 seconds
Step 120500/150000, Loss: 4.131813671588898, Test Loss: 4.181842803955078, LR: 5e-06, Elapsed Time: 2752.61 seconds
Step 120600/150000, Loss: 4.123791677951813, Test Loss: 4.181964874267578, LR: 5e-06, Elapsed Time: 2754.86 seconds
Step 120700/150000, Loss: 4.135897347927093, Test Loss: 4.1819958090782166, LR: 5e-06, Elapsed Time: 2757.12 seconds
Step 120800/150000, Loss: 4.1303275847435, Test Loss: 4.18209844827652, LR: 5e-06, Elapsed Time: 2759.38 seconds
Step 120900/150000, Loss: 4.135337829589844, Test Loss: 4.182056546211243, LR: 5e-06, Elapsed Time: 2761.65 seconds
Step 121000/150000, Loss: 4.132717368602752, Test Loss: 4.182052135467529, LR: 5e-06, Elapsed Time: 2763.90 seconds
Step 121100/150000, Loss: 4.134334852695465, Test Loss: 4.181936264038086, LR: 5e-06, Elapsed Time: 2766.16 seconds
Step 121200/150000, Loss: 4.123981580734253, Test Loss: 4.182041883468628, LR: 5e-06, Elapsed Time: 2768.41 seconds
Step 121300/150000, Loss: 4.127435364723206, Test Loss: 4.1821372509002686, LR: 5e-06, Elapsed Time: 2770.67 seconds
Step 121400/150000, Loss: 4.159071564674377, Test Loss: 4.181911468505859, LR: 5e-06, Elapsed Time: 2772.93 seconds
Step 121500/150000, Loss: 4.1649861335754395, Test Loss: 4.181731402873993, LR: 5e-06, Elapsed Time: 2775.18 seconds
Step 121600/150000, Loss: 4.158884854316711, Test Loss: 4.181727230548859, LR: 5e-06, Elapsed Time: 2777.44 seconds
Step 121700/150000, Loss: 4.1710513401031495, Test Loss: 4.181761384010315, LR: 5e-06, Elapsed Time: 2779.70 seconds
Step 121800/150000, Loss: 4.153055021762848, Test Loss: 4.1818326115608215, LR: 5e-06, Elapsed Time: 2781.96 seconds
Step 121900/150000, Loss: 4.159375276565552, Test Loss: 4.181648552417755, LR: 5e-06, Elapsed Time: 2784.22 seconds
Step 122000/150000, Loss: 4.163316497802734, Test Loss: 4.18173211812973, LR: 5e-06, Elapsed Time: 2786.47 seconds
Step 122100/150000, Loss: 4.153690824508667, Test Loss: 4.181877255439758, LR: 5e-06, Elapsed Time: 2788.72 seconds
Step 122200/150000, Loss: 4.157295455932617, Test Loss: 4.181889057159424, LR: 5e-06, Elapsed Time: 2790.98 seconds
Step 122300/150000, Loss: 4.159229340553284, Test Loss: 4.181705415248871, LR: 5e-06, Elapsed Time: 2793.23 seconds
Step 122400/150000, Loss: 4.156752336025238, Test Loss: 4.1817368268966675, LR: 5e-06, Elapsed Time: 2795.48 seconds
Step 122500/150000, Loss: 4.147787728309631, Test Loss: 4.181783080101013, LR: 5e-06, Elapsed Time: 2797.74 seconds
Step 122600/150000, Loss: 4.16032422542572, Test Loss: 4.181823253631592, LR: 5e-06, Elapsed Time: 2799.99 seconds
Step 122700/150000, Loss: 4.165674891471863, Test Loss: 4.181883990764618, LR: 5e-06, Elapsed Time: 2802.25 seconds
Step 122800/150000, Loss: 4.153863697052002, Test Loss: 4.18187552690506, LR: 5e-06, Elapsed Time: 2804.50 seconds
Step 122900/150000, Loss: 4.150179584026336, Test Loss: 4.181854486465454, LR: 5e-06, Elapsed Time: 2806.76 seconds
Step 123000/150000, Loss: 4.148250744342804, Test Loss: 4.181765377521515, LR: 5e-06, Elapsed Time: 2809.02 seconds
Step 123100/150000, Loss: 4.157749433517456, Test Loss: 4.181732773780823, LR: 5e-06, Elapsed Time: 2811.29 seconds
Step 123200/150000, Loss: 4.1519507837295535, Test Loss: 4.18162339925766, LR: 5e-06, Elapsed Time: 2813.55 seconds
Step 123300/150000, Loss: 4.156062030792237, Test Loss: 4.181559681892395, LR: 5e-06, Elapsed Time: 2815.81 seconds
Step 123400/150000, Loss: 4.161894068717957, Test Loss: 4.181565284729004, LR: 5e-06, Elapsed Time: 2818.07 seconds
Step 123500/150000, Loss: 4.1506724405288695, Test Loss: 4.181475520133972, LR: 5e-06, Elapsed Time: 2820.34 seconds
Step 123600/150000, Loss: 4.152840025424958, Test Loss: 4.181672990322113, LR: 5e-06, Elapsed Time: 2822.59 seconds
Step 123700/150000, Loss: 4.156029741764069, Test Loss: 4.181765556335449, LR: 5e-06, Elapsed Time: 2824.85 seconds
Step 123800/150000, Loss: 4.1599409294128415, Test Loss: 4.181640803813934, LR: 5e-06, Elapsed Time: 2827.12 seconds
Step 123900/150000, Loss: 4.146530921459198, Test Loss: 4.181649565696716, LR: 5e-06, Elapsed Time: 2829.37 seconds
Step 124000/150000, Loss: 4.152094464302063, Test Loss: 4.181772589683533, LR: 5e-06, Elapsed Time: 2831.64 seconds
Step 124100/150000, Loss: 4.142446751594544, Test Loss: 4.181711852550507, LR: 5e-06, Elapsed Time: 2833.89 seconds
Step 124200/150000, Loss: 4.148900952339172, Test Loss: 4.181776404380798, LR: 5e-06, Elapsed Time: 2836.14 seconds
Step 124300/150000, Loss: 4.156900997161865, Test Loss: 4.1817467212677, LR: 5e-06, Elapsed Time: 2838.40 seconds
Step 124400/150000, Loss: 4.148980157375336, Test Loss: 4.181779265403748, LR: 5e-06, Elapsed Time: 2840.66 seconds
Step 124500/150000, Loss: 4.150394163131714, Test Loss: 4.181774258613586, LR: 5e-06, Elapsed Time: 2842.92 seconds
Step 124600/150000, Loss: 4.154425678253173, Test Loss: 4.181757986545563, LR: 5e-06, Elapsed Time: 2845.17 seconds
Step 124700/150000, Loss: 4.139901387691498, Test Loss: 4.181874454021454, LR: 5e-06, Elapsed Time: 2847.44 seconds
Step 124800/150000, Loss: 4.15435087442398, Test Loss: 4.181739687919617, LR: 5e-06, Elapsed Time: 2849.68 seconds
Step 124900/150000, Loss: 4.156291320323944, Test Loss: 4.1817198395729065, LR: 5e-06, Elapsed Time: 2851.94 seconds
Step 125000/150000, Loss: 4.1428249549865725, Test Loss: 4.181617915630341, LR: 5e-06, Elapsed Time: 2854.19 seconds
Step 125100/150000, Loss: 4.161484522819519, Test Loss: 4.181633830070496, LR: 5e-06, Elapsed Time: 2856.44 seconds
Step 125200/150000, Loss: 4.150545539855957, Test Loss: 4.1815988421440125, LR: 5e-06, Elapsed Time: 2858.69 seconds
Step 125300/150000, Loss: 4.138963198661804, Test Loss: 4.181654214859009, LR: 5e-06, Elapsed Time: 2860.95 seconds
Step 125400/150000, Loss: 4.14000364780426, Test Loss: 4.181680500507355, LR: 5e-06, Elapsed Time: 2863.20 seconds
Step 125500/150000, Loss: 4.157096199989319, Test Loss: 4.181566953659058, LR: 5e-06, Elapsed Time: 2865.45 seconds
Step 125600/150000, Loss: 4.146410193443298, Test Loss: 4.181634366512299, LR: 5e-06, Elapsed Time: 2867.71 seconds
Step 125700/150000, Loss: 4.152270107269287, Test Loss: 4.181476593017578, LR: 5e-06, Elapsed Time: 2869.96 seconds
Step 125800/150000, Loss: 4.151828274726868, Test Loss: 4.181437730789185, LR: 5e-06, Elapsed Time: 2872.22 seconds
Step 125900/150000, Loss: 4.14664783000946, Test Loss: 4.181487798690796, LR: 5e-06, Elapsed Time: 2874.47 seconds
Step 126000/150000, Loss: 4.155303215980529, Test Loss: 4.181423366069794, LR: 5e-06, Elapsed Time: 2876.73 seconds
Step 126100/150000, Loss: 4.149375920295715, Test Loss: 4.1813507080078125, LR: 5e-06, Elapsed Time: 2878.98 seconds
Step 126200/150000, Loss: 4.157417650222778, Test Loss: 4.181538641452789, LR: 5e-06, Elapsed Time: 2881.24 seconds
Step 126300/150000, Loss: 4.14254980802536, Test Loss: 4.181646227836609, LR: 5e-06, Elapsed Time: 2883.50 seconds
Step 126400/150000, Loss: 4.140755379199982, Test Loss: 4.181546211242676, LR: 5e-06, Elapsed Time: 2885.75 seconds
Step 126500/150000, Loss: 4.1530619668960576, Test Loss: 4.1815454959869385, LR: 5e-06, Elapsed Time: 2888.01 seconds
Step 126600/150000, Loss: 4.145195422172546, Test Loss: 4.181601047515869, LR: 5e-06, Elapsed Time: 2890.27 seconds
Step 126700/150000, Loss: 4.158602879047394, Test Loss: 4.181508779525757, LR: 5e-06, Elapsed Time: 2892.53 seconds
Step 126800/150000, Loss: 4.145369207859039, Test Loss: 4.181495845317841, LR: 5e-06, Elapsed Time: 2894.79 seconds
Step 126900/150000, Loss: 4.131147680282592, Test Loss: 4.181473135948181, LR: 5e-06, Elapsed Time: 2897.05 seconds
Step 127000/150000, Loss: 4.136784770488739, Test Loss: 4.181524276733398, LR: 5e-06, Elapsed Time: 2899.31 seconds
Step 127100/150000, Loss: 4.148173480033875, Test Loss: 4.181537389755249, LR: 5e-06, Elapsed Time: 2901.57 seconds
Step 127200/150000, Loss: 4.154543423652649, Test Loss: 4.181467592716217, LR: 5e-06, Elapsed Time: 2903.83 seconds
Step 127300/150000, Loss: 4.141052598953247, Test Loss: 4.181428015232086, LR: 5e-06, Elapsed Time: 2906.09 seconds
Step 127400/150000, Loss: 4.150248193740845, Test Loss: 4.181328594684601, LR: 5e-06, Elapsed Time: 2908.34 seconds
Step 127500/150000, Loss: 4.143479623794556, Test Loss: 4.181455135345459, LR: 5e-06, Elapsed Time: 2910.60 seconds
Step 127600/150000, Loss: 4.142916204929352, Test Loss: 4.1814024448394775, LR: 5e-06, Elapsed Time: 2912.85 seconds
Step 127700/150000, Loss: 4.135766248703003, Test Loss: 4.1814568638801575, LR: 5e-06, Elapsed Time: 2915.11 seconds
Step 127800/150000, Loss: 4.142787203788758, Test Loss: 4.1815338134765625, LR: 5e-06, Elapsed Time: 2917.37 seconds
Step 127900/150000, Loss: 4.153836424350739, Test Loss: 4.181381344795227, LR: 5e-06, Elapsed Time: 2919.62 seconds
Step 128000/150000, Loss: 4.150507664680481, Test Loss: 4.181241989135742, LR: 5e-06, Elapsed Time: 2921.88 seconds
Step 128100/150000, Loss: 4.145661752223969, Test Loss: 4.1812968254089355, LR: 5e-06, Elapsed Time: 2924.14 seconds
Step 128200/150000, Loss: 4.1499343729019165, Test Loss: 4.181242823600769, LR: 5e-06, Elapsed Time: 2926.40 seconds
Step 128300/150000, Loss: 4.14426750421524, Test Loss: 4.1813793778419495, LR: 5e-06, Elapsed Time: 2928.67 seconds
Step 128400/150000, Loss: 4.1366752576828, Test Loss: 4.181293487548828, LR: 5e-06, Elapsed Time: 2930.92 seconds
Step 128500/150000, Loss: 4.140944304466248, Test Loss: 4.181335091590881, LR: 5e-06, Elapsed Time: 2933.19 seconds
Step 128600/150000, Loss: 4.137900934219361, Test Loss: 4.181239545345306, LR: 5e-06, Elapsed Time: 2935.44 seconds
Step 128700/150000, Loss: 4.138459091186523, Test Loss: 4.181259989738464, LR: 5e-06, Elapsed Time: 2937.70 seconds
Step 128800/150000, Loss: 4.142805697917939, Test Loss: 4.18126517534256, LR: 5e-06, Elapsed Time: 2939.96 seconds
Step 128900/150000, Loss: 4.141567413806915, Test Loss: 4.18113374710083, LR: 5e-06, Elapsed Time: 2942.22 seconds
Step 129000/150000, Loss: 4.136330661773681, Test Loss: 4.1813578605651855, LR: 5e-06, Elapsed Time: 2944.47 seconds
Step 129100/150000, Loss: 4.138098182678223, Test Loss: 4.18156772851944, LR: 5e-06, Elapsed Time: 2946.72 seconds
Step 129200/150000, Loss: 4.153723649978637, Test Loss: 4.1814223527908325, LR: 5e-06, Elapsed Time: 2948.98 seconds
Step 129300/150000, Loss: 4.161256446838379, Test Loss: 4.1814181208610535, LR: 5e-06, Elapsed Time: 2951.25 seconds
Step 129400/150000, Loss: 4.165924334526062, Test Loss: 4.181452631950378, LR: 5e-06, Elapsed Time: 2953.51 seconds
Step 129500/150000, Loss: 4.150084707736969, Test Loss: 4.181361019611359, LR: 5e-06, Elapsed Time: 2955.77 seconds
Step 129600/150000, Loss: 4.150240681171417, Test Loss: 4.181359350681305, LR: 5e-06, Elapsed Time: 2958.03 seconds
Step 129700/150000, Loss: 4.157313137054444, Test Loss: 4.181387841701508, LR: 5e-06, Elapsed Time: 2960.29 seconds
Step 129800/150000, Loss: 4.1555680084228515, Test Loss: 4.1813108921051025, LR: 5e-06, Elapsed Time: 2962.55 seconds
Step 129900/150000, Loss: 4.159912166595459, Test Loss: 4.181372821331024, LR: 5e-06, Elapsed Time: 2964.81 seconds
Step 130000/150000, Loss: 4.151067097187042, Test Loss: 4.181275010108948, LR: 5e-06, Elapsed Time: 2967.07 seconds
Step 130100/150000, Loss: 4.160516471862793, Test Loss: 4.181325018405914, LR: 5e-06, Elapsed Time: 2969.33 seconds
Step 130200/150000, Loss: 4.14968774318695, Test Loss: 4.1813576221466064, LR: 5e-06, Elapsed Time: 2971.58 seconds
Step 130300/150000, Loss: 4.150251178741455, Test Loss: 4.181163787841797, LR: 5e-06, Elapsed Time: 2973.85 seconds
Step 130400/150000, Loss: 4.1699363899230955, Test Loss: 4.181101500988007, LR: 5e-06, Elapsed Time: 2976.11 seconds
Step 130500/150000, Loss: 4.157500033378601, Test Loss: 4.18120151758194, LR: 5e-06, Elapsed Time: 2978.36 seconds
Step 130600/150000, Loss: 4.166793093681336, Test Loss: 4.181151330471039, LR: 5e-06, Elapsed Time: 2980.62 seconds
Step 130700/150000, Loss: 4.16063283443451, Test Loss: 4.181126058101654, LR: 5e-06, Elapsed Time: 2982.87 seconds
Step 130800/150000, Loss: 4.161303510665894, Test Loss: 4.181246042251587, LR: 5e-06, Elapsed Time: 2985.13 seconds
Step 130900/150000, Loss: 4.164380402565002, Test Loss: 4.181098163127899, LR: 5e-06, Elapsed Time: 2987.38 seconds
Step 131000/150000, Loss: 4.151716828346252, Test Loss: 4.1813271045684814, LR: 5e-06, Elapsed Time: 2989.63 seconds
Step 131100/150000, Loss: 4.159277296066284, Test Loss: 4.181358873844147, LR: 5e-06, Elapsed Time: 2991.89 seconds
Step 131200/150000, Loss: 4.1548303031921385, Test Loss: 4.181319355964661, LR: 5e-06, Elapsed Time: 2994.15 seconds
Step 131300/150000, Loss: 4.154317810535431, Test Loss: 4.181254863739014, LR: 5e-06, Elapsed Time: 2996.40 seconds
Step 131400/150000, Loss: 4.152444107532501, Test Loss: 4.1810285449028015, LR: 5e-06, Elapsed Time: 2998.65 seconds
Step 131500/150000, Loss: 4.154593987464905, Test Loss: 4.180910587310791, LR: 5e-06, Elapsed Time: 3000.91 seconds
Step 131600/150000, Loss: 4.150109543800354, Test Loss: 4.180918455123901, LR: 5e-06, Elapsed Time: 3003.17 seconds
Step 131700/150000, Loss: 4.156760025024414, Test Loss: 4.180985331535339, LR: 5e-06, Elapsed Time: 3005.42 seconds
Step 131800/150000, Loss: 4.153700911998749, Test Loss: 4.180987775325775, LR: 5e-06, Elapsed Time: 3007.69 seconds
Step 131900/150000, Loss: 4.155892918109894, Test Loss: 4.180764675140381, LR: 5e-06, Elapsed Time: 3009.95 seconds
Step 132000/150000, Loss: 4.156841316223145, Test Loss: 4.181000053882599, LR: 5e-06, Elapsed Time: 3012.20 seconds
Step 132100/150000, Loss: 4.1526202297210695, Test Loss: 4.181062877178192, LR: 5e-06, Elapsed Time: 3014.46 seconds
Step 132200/150000, Loss: 4.1479080867767335, Test Loss: 4.1810256242752075, LR: 5e-06, Elapsed Time: 3016.72 seconds
Step 132300/150000, Loss: 4.132697849273682, Test Loss: 4.181207120418549, LR: 5e-06, Elapsed Time: 3018.97 seconds
Step 132400/150000, Loss: 4.152479288578033, Test Loss: 4.181059300899506, LR: 5e-06, Elapsed Time: 3021.23 seconds
Step 132500/150000, Loss: 4.149134273529053, Test Loss: 4.1810537576675415, LR: 5e-06, Elapsed Time: 3023.48 seconds
Step 132600/150000, Loss: 4.142528357505799, Test Loss: 4.181151211261749, LR: 5e-06, Elapsed Time: 3025.73 seconds
Step 132700/150000, Loss: 4.1589872980117795, Test Loss: 4.180977702140808, LR: 5e-06, Elapsed Time: 3027.99 seconds
Step 132800/150000, Loss: 4.149524872303009, Test Loss: 4.1809868812561035, LR: 5e-06, Elapsed Time: 3030.24 seconds
Step 132900/150000, Loss: 4.154537162780762, Test Loss: 4.18122124671936, LR: 5e-06, Elapsed Time: 3032.50 seconds
Step 133000/150000, Loss: 4.158372926712036, Test Loss: 4.181107938289642, LR: 5e-06, Elapsed Time: 3034.76 seconds
Step 133100/150000, Loss: 4.145253925323487, Test Loss: 4.1810309290885925, LR: 5e-06, Elapsed Time: 3037.02 seconds
Step 133200/150000, Loss: 4.1399498462677, Test Loss: 4.181110739707947, LR: 5e-06, Elapsed Time: 3039.27 seconds
Step 133300/150000, Loss: 4.16040673494339, Test Loss: 4.180877983570099, LR: 5e-06, Elapsed Time: 3041.53 seconds
Step 133400/150000, Loss: 4.14818953037262, Test Loss: 4.1810103058815, LR: 5e-06, Elapsed Time: 3043.78 seconds
Step 133500/150000, Loss: 4.146742291450501, Test Loss: 4.180799186229706, LR: 5e-06, Elapsed Time: 3046.03 seconds
Step 133600/150000, Loss: 4.155557007789612, Test Loss: 4.180810868740082, LR: 5e-06, Elapsed Time: 3048.29 seconds
Step 133700/150000, Loss: 4.150594158172607, Test Loss: 4.180923640727997, LR: 5e-06, Elapsed Time: 3050.54 seconds
Step 133800/150000, Loss: 4.145835716724395, Test Loss: 4.1808470487594604, LR: 5e-06, Elapsed Time: 3052.80 seconds
Step 133900/150000, Loss: 4.144240927696228, Test Loss: 4.180925488471985, LR: 5e-06, Elapsed Time: 3055.05 seconds
Step 134000/150000, Loss: 4.143832082748413, Test Loss: 4.18086576461792, LR: 5e-06, Elapsed Time: 3057.31 seconds
Step 134100/150000, Loss: 4.143670716285706, Test Loss: 4.180878460407257, LR: 5e-06, Elapsed Time: 3059.56 seconds
Step 134200/150000, Loss: 4.140794360637665, Test Loss: 4.180964112281799, LR: 5e-06, Elapsed Time: 3061.81 seconds
Step 134300/150000, Loss: 4.1404292154312134, Test Loss: 4.180882215499878, LR: 5e-06, Elapsed Time: 3064.06 seconds
Step 134400/150000, Loss: 4.148724758625031, Test Loss: 4.180740237236023, LR: 5e-06, Elapsed Time: 3066.32 seconds
Step 134500/150000, Loss: 4.1556789541244505, Test Loss: 4.18079686164856, LR: 5e-06, Elapsed Time: 3068.56 seconds
Step 134600/150000, Loss: 4.149005475044251, Test Loss: 4.180815100669861, LR: 5e-06, Elapsed Time: 3070.82 seconds
Step 134700/150000, Loss: 4.161994409561157, Test Loss: 4.180668950080872, LR: 5e-06, Elapsed Time: 3073.07 seconds
Step 134800/150000, Loss: 4.1566725969314575, Test Loss: 4.18065619468689, LR: 5e-06, Elapsed Time: 3075.33 seconds
Step 134900/150000, Loss: 4.152757797241211, Test Loss: 4.1806018352508545, LR: 5e-06, Elapsed Time: 3077.59 seconds
Step 135000/150000, Loss: 4.155890913009643, Test Loss: 4.1806640625, LR: 5e-06, Elapsed Time: 3079.84 seconds
Step 135100/150000, Loss: 4.1620371055603025, Test Loss: 4.180613458156586, LR: 5e-06, Elapsed Time: 3082.10 seconds
Step 135200/150000, Loss: 4.150667989253998, Test Loss: 4.180708050727844, LR: 5e-06, Elapsed Time: 3084.35 seconds
Step 135300/150000, Loss: 4.158370022773743, Test Loss: 4.180633962154388, LR: 5e-06, Elapsed Time: 3086.61 seconds
Step 135400/150000, Loss: 4.148505845069885, Test Loss: 4.1807741522789, LR: 5e-06, Elapsed Time: 3088.86 seconds
Step 135500/150000, Loss: 4.148550260066986, Test Loss: 4.180688500404358, LR: 5e-06, Elapsed Time: 3091.11 seconds
Step 135600/150000, Loss: 4.1600560235977175, Test Loss: 4.1806416511535645, LR: 5e-06, Elapsed Time: 3093.37 seconds
Step 135700/150000, Loss: 4.142225279808044, Test Loss: 4.180648863315582, LR: 5e-06, Elapsed Time: 3095.62 seconds
Step 135800/150000, Loss: 4.157450475692749, Test Loss: 4.180757164955139, LR: 5e-06, Elapsed Time: 3097.88 seconds
Step 135900/150000, Loss: 4.15175096988678, Test Loss: 4.180720567703247, LR: 5e-06, Elapsed Time: 3100.14 seconds
Step 136000/150000, Loss: 4.155349378585815, Test Loss: 4.180803835391998, LR: 5e-06, Elapsed Time: 3102.40 seconds
Step 136100/150000, Loss: 4.155440516471863, Test Loss: 4.1807039976119995, LR: 5e-06, Elapsed Time: 3104.65 seconds
Step 136200/150000, Loss: 4.157934260368347, Test Loss: 4.18072122335434, LR: 5e-06, Elapsed Time: 3106.91 seconds
Step 136300/150000, Loss: 4.139308223724365, Test Loss: 4.180696070194244, LR: 5e-06, Elapsed Time: 3109.16 seconds
Step 136400/150000, Loss: 4.149413270950317, Test Loss: 4.18073034286499, LR: 5e-06, Elapsed Time: 3111.42 seconds
Step 136500/150000, Loss: 4.149836130142212, Test Loss: 4.180678188800812, LR: 5e-06, Elapsed Time: 3113.68 seconds
Step 136600/150000, Loss: 4.147969222068786, Test Loss: 4.180661857128143, LR: 5e-06, Elapsed Time: 3115.94 seconds
Step 136700/150000, Loss: 4.153535614013672, Test Loss: 4.180656909942627, LR: 5e-06, Elapsed Time: 3118.19 seconds
Step 136800/150000, Loss: 4.149936199188232, Test Loss: 4.180674254894257, LR: 5e-06, Elapsed Time: 3120.45 seconds
Step 136900/150000, Loss: 4.145710363388061, Test Loss: 4.180688798427582, LR: 5e-06, Elapsed Time: 3122.70 seconds
Step 137000/150000, Loss: 4.1526238322258, Test Loss: 4.1806952357292175, LR: 5e-06, Elapsed Time: 3124.96 seconds
Step 137100/150000, Loss: 4.140164530277252, Test Loss: 4.180743753910065, LR: 5e-06, Elapsed Time: 3127.22 seconds
Step 137200/150000, Loss: 4.154197213649749, Test Loss: 4.18074768781662, LR: 5e-06, Elapsed Time: 3129.47 seconds
Step 137300/150000, Loss: 4.1588286519050595, Test Loss: 4.1807573437690735, LR: 5e-06, Elapsed Time: 3131.73 seconds
Step 137400/150000, Loss: 4.146291317939759, Test Loss: 4.180715978145599, LR: 5e-06, Elapsed Time: 3133.98 seconds
Step 137500/150000, Loss: 4.142855167388916, Test Loss: 4.180676221847534, LR: 5e-06, Elapsed Time: 3136.23 seconds
Step 137600/150000, Loss: 4.151567249298096, Test Loss: 4.180994093418121, LR: 5e-06, Elapsed Time: 3138.49 seconds
Step 137700/150000, Loss: 4.148460640907287, Test Loss: 4.180766224861145, LR: 5e-06, Elapsed Time: 3140.74 seconds
Step 137800/150000, Loss: 4.1576125431060795, Test Loss: 4.180857837200165, LR: 5e-06, Elapsed Time: 3143.01 seconds
Step 137900/150000, Loss: 4.148126668930054, Test Loss: 4.180827796459198, LR: 5e-06, Elapsed Time: 3145.26 seconds
Step 138000/150000, Loss: 4.144606010913849, Test Loss: 4.180875599384308, LR: 5e-06, Elapsed Time: 3147.51 seconds
Step 138100/150000, Loss: 4.142617239952087, Test Loss: 4.180846452713013, LR: 5e-06, Elapsed Time: 3149.77 seconds
Step 138200/150000, Loss: 4.145609557628632, Test Loss: 4.180811822414398, LR: 5e-06, Elapsed Time: 3152.03 seconds
Step 138300/150000, Loss: 4.145351595878601, Test Loss: 4.180744528770447, LR: 5e-06, Elapsed Time: 3154.28 seconds
Step 138400/150000, Loss: 4.143880844116211, Test Loss: 4.1807814836502075, LR: 5e-06, Elapsed Time: 3156.54 seconds
Step 138500/150000, Loss: 4.14545716047287, Test Loss: 4.180710971355438, LR: 5e-06, Elapsed Time: 3158.79 seconds
Step 138600/150000, Loss: 4.135269021987915, Test Loss: 4.180614590644836, LR: 5e-06, Elapsed Time: 3161.05 seconds
Step 138700/150000, Loss: 4.156927795410156, Test Loss: 4.180579125881195, LR: 5e-06, Elapsed Time: 3163.30 seconds
Step 138800/150000, Loss: 4.134905803203583, Test Loss: 4.180674970149994, LR: 5e-06, Elapsed Time: 3165.55 seconds
Step 138900/150000, Loss: 4.145070972442627, Test Loss: 4.180669188499451, LR: 5e-06, Elapsed Time: 3167.81 seconds
Step 139000/150000, Loss: 4.148415899276733, Test Loss: 4.18072384595871, LR: 5e-06, Elapsed Time: 3170.07 seconds
Step 139100/150000, Loss: 4.135412149429321, Test Loss: 4.180743217468262, LR: 5e-06, Elapsed Time: 3172.33 seconds
Step 139200/150000, Loss: 4.141506609916687, Test Loss: 4.180526077747345, LR: 5e-06, Elapsed Time: 3174.58 seconds
Step 139300/150000, Loss: 4.136495673656464, Test Loss: 4.180550515651703, LR: 5e-06, Elapsed Time: 3176.84 seconds
Step 139400/150000, Loss: 4.146461436748504, Test Loss: 4.180645108222961, LR: 5e-06, Elapsed Time: 3179.10 seconds
Step 139500/150000, Loss: 4.140581934452057, Test Loss: 4.180588781833649, LR: 5e-06, Elapsed Time: 3181.36 seconds
Step 139600/150000, Loss: 4.151443638801575, Test Loss: 4.1805760860443115, LR: 5e-06, Elapsed Time: 3183.61 seconds
Step 139700/150000, Loss: 4.136056385040283, Test Loss: 4.180652916431427, LR: 5e-06, Elapsed Time: 3185.87 seconds
Step 139800/150000, Loss: 4.146726565361023, Test Loss: 4.180597305297852, LR: 5e-06, Elapsed Time: 3188.13 seconds
Step 139900/150000, Loss: 4.139076569080353, Test Loss: 4.180607259273529, LR: 5e-06, Elapsed Time: 3190.38 seconds
Step 140000/150000, Loss: 4.134179356098175, Test Loss: 4.180575907230377, LR: 5e-06, Elapsed Time: 3192.63 seconds
Step 140100/150000, Loss: 4.138518481254578, Test Loss: 4.180704176425934, LR: 5e-06, Elapsed Time: 3194.88 seconds
Step 140200/150000, Loss: 4.145107164382934, Test Loss: 4.1807790994644165, LR: 5e-06, Elapsed Time: 3197.14 seconds
Step 140300/150000, Loss: 4.1344160962104795, Test Loss: 4.180650591850281, LR: 5e-06, Elapsed Time: 3199.40 seconds
Step 140400/150000, Loss: 4.144437236785889, Test Loss: 4.180719792842865, LR: 5e-06, Elapsed Time: 3201.65 seconds
Step 140500/150000, Loss: 4.147898693084716, Test Loss: 4.180627763271332, LR: 5e-06, Elapsed Time: 3203.90 seconds
Step 140600/150000, Loss: 4.136467070579529, Test Loss: 4.180700957775116, LR: 5e-06, Elapsed Time: 3206.16 seconds
Step 140700/150000, Loss: 4.123574466705322, Test Loss: 4.180872619152069, LR: 5e-06, Elapsed Time: 3208.41 seconds
Step 140800/150000, Loss: 4.1416349649429325, Test Loss: 4.180676102638245, LR: 5e-06, Elapsed Time: 3210.66 seconds
Step 140900/150000, Loss: 4.1361446905136106, Test Loss: 4.18076229095459, LR: 5e-06, Elapsed Time: 3212.92 seconds
Step 141000/150000, Loss: 4.140163555145263, Test Loss: 4.180735766887665, LR: 5e-06, Elapsed Time: 3215.17 seconds
Step 141100/150000, Loss: 4.130803995132446, Test Loss: 4.180750012397766, LR: 5e-06, Elapsed Time: 3217.43 seconds
Step 141200/150000, Loss: 4.129034531116486, Test Loss: 4.180616140365601, LR: 5e-06, Elapsed Time: 3219.68 seconds
Step 141300/150000, Loss: 4.135971484184265, Test Loss: 4.180709600448608, LR: 5e-06, Elapsed Time: 3221.94 seconds
Step 141400/150000, Loss: 4.138126845359802, Test Loss: 4.180687785148621, LR: 5e-06, Elapsed Time: 3224.20 seconds
Step 141500/150000, Loss: 4.147747678756714, Test Loss: 4.1807098388671875, LR: 5e-06, Elapsed Time: 3226.45 seconds
Step 141600/150000, Loss: 4.130188779830933, Test Loss: 4.180795609951019, LR: 5e-06, Elapsed Time: 3228.71 seconds
Step 141700/150000, Loss: 4.1367925262451175, Test Loss: 4.180750787258148, LR: 5e-06, Elapsed Time: 3230.96 seconds
Step 141800/150000, Loss: 4.142098579406738, Test Loss: 4.180681765079498, LR: 5e-06, Elapsed Time: 3233.22 seconds
Step 141900/150000, Loss: 4.136783003807068, Test Loss: 4.180688142776489, LR: 5e-06, Elapsed Time: 3235.47 seconds
Step 142000/150000, Loss: 4.134648907184601, Test Loss: 4.180769979953766, LR: 5e-06, Elapsed Time: 3237.72 seconds
Step 142100/150000, Loss: 4.136258013248444, Test Loss: 4.180747926235199, LR: 5e-06, Elapsed Time: 3239.97 seconds
Step 142200/150000, Loss: 4.148324015140534, Test Loss: 4.180603086948395, LR: 5e-06, Elapsed Time: 3242.22 seconds
Step 142300/150000, Loss: 4.126793384552002, Test Loss: 4.180701911449432, LR: 5e-06, Elapsed Time: 3244.46 seconds
Step 142400/150000, Loss: 4.130730347633362, Test Loss: 4.180586993694305, LR: 5e-06, Elapsed Time: 3246.71 seconds
Step 142500/150000, Loss: 4.1303426551818845, Test Loss: 4.180789530277252, LR: 5e-06, Elapsed Time: 3248.97 seconds
Step 142600/150000, Loss: 4.128604137897492, Test Loss: 4.180763602256775, LR: 5e-06, Elapsed Time: 3251.23 seconds
Step 142700/150000, Loss: 4.130739464759826, Test Loss: 4.180921673774719, LR: 5e-06, Elapsed Time: 3253.48 seconds
Step 142800/150000, Loss: 4.134466123580933, Test Loss: 4.180934131145477, LR: 5e-06, Elapsed Time: 3255.72 seconds
Step 142900/150000, Loss: 4.136259725093842, Test Loss: 4.180843830108643, LR: 5e-06, Elapsed Time: 3257.98 seconds
Step 143000/150000, Loss: 4.1253086376190184, Test Loss: 4.180842876434326, LR: 5e-06, Elapsed Time: 3260.23 seconds
Step 143100/150000, Loss: 4.125257115364075, Test Loss: 4.180930733680725, LR: 5e-06, Elapsed Time: 3262.49 seconds
Step 143200/150000, Loss: 4.130850279331208, Test Loss: 4.180885493755341, LR: 5e-06, Elapsed Time: 3264.74 seconds
Step 143300/150000, Loss: 4.162068078517914, Test Loss: 4.1806520819664, LR: 5e-06, Elapsed Time: 3266.99 seconds
Step 143400/150000, Loss: 4.165802745819092, Test Loss: 4.180557191371918, LR: 5e-06, Elapsed Time: 3269.25 seconds
Step 143500/150000, Loss: 4.161580121517181, Test Loss: 4.180514812469482, LR: 5e-06, Elapsed Time: 3271.51 seconds
Step 143600/150000, Loss: 4.161072676181793, Test Loss: 4.180498957633972, LR: 5e-06, Elapsed Time: 3273.76 seconds
Step 143700/150000, Loss: 4.148600263595581, Test Loss: 4.180596709251404, LR: 5e-06, Elapsed Time: 3276.01 seconds
Step 143800/150000, Loss: 4.159546422958374, Test Loss: 4.180392622947693, LR: 5e-06, Elapsed Time: 3278.26 seconds
Step 143900/150000, Loss: 4.161140251159668, Test Loss: 4.180577218532562, LR: 5e-06, Elapsed Time: 3280.51 seconds
Step 144000/150000, Loss: 4.1581151485443115, Test Loss: 4.180643200874329, LR: 5e-06, Elapsed Time: 3282.77 seconds
Step 144100/150000, Loss: 4.153283135890961, Test Loss: 4.180673182010651, LR: 5e-06, Elapsed Time: 3285.03 seconds
Step 144200/150000, Loss: 4.159139046669006, Test Loss: 4.180613815784454, LR: 5e-06, Elapsed Time: 3287.28 seconds
Step 144300/150000, Loss: 4.149131350517273, Test Loss: 4.1805219650268555, LR: 5e-06, Elapsed Time: 3289.53 seconds
Step 144400/150000, Loss: 4.149352078437805, Test Loss: 4.180687367916107, LR: 5e-06, Elapsed Time: 3291.79 seconds
Step 144500/150000, Loss: 4.156682364940643, Test Loss: 4.180707037448883, LR: 5e-06, Elapsed Time: 3294.04 seconds
Step 144600/150000, Loss: 4.166525187492371, Test Loss: 4.180635154247284, LR: 5e-06, Elapsed Time: 3296.30 seconds
Step 144700/150000, Loss: 4.154554615020752, Test Loss: 4.180738031864166, LR: 5e-06, Elapsed Time: 3298.56 seconds
Step 144800/150000, Loss: 4.14352961063385, Test Loss: 4.180616021156311, LR: 5e-06, Elapsed Time: 3300.82 seconds
Step 144900/150000, Loss: 4.151920824050904, Test Loss: 4.180610001087189, LR: 5e-06, Elapsed Time: 3303.08 seconds
Step 145000/150000, Loss: 4.148179974555969, Test Loss: 4.180620789527893, LR: 5e-06, Elapsed Time: 3305.33 seconds
Step 145100/150000, Loss: 4.155771980285644, Test Loss: 4.18039208650589, LR: 5e-06, Elapsed Time: 3307.58 seconds
Step 145200/150000, Loss: 4.154883942604065, Test Loss: 4.180370271205902, LR: 5e-06, Elapsed Time: 3309.83 seconds
Step 145300/150000, Loss: 4.154076066017151, Test Loss: 4.1803149580955505, LR: 5e-06, Elapsed Time: 3312.08 seconds
Step 145400/150000, Loss: 4.1558993887901305, Test Loss: 4.180402517318726, LR: 5e-06, Elapsed Time: 3314.33 seconds
Step 145500/150000, Loss: 4.1507432389259336, Test Loss: 4.180459439754486, LR: 5e-06, Elapsed Time: 3316.59 seconds
Step 145600/150000, Loss: 4.14922342300415, Test Loss: 4.180556654930115, LR: 5e-06, Elapsed Time: 3318.84 seconds
Step 145700/150000, Loss: 4.164098992347717, Test Loss: 4.180440068244934, LR: 5e-06, Elapsed Time: 3321.10 seconds
Step 145800/150000, Loss: 4.145811583995819, Test Loss: 4.18051677942276, LR: 5e-06, Elapsed Time: 3323.36 seconds
Step 145900/150000, Loss: 4.1386399078369145, Test Loss: 4.180556654930115, LR: 5e-06, Elapsed Time: 3325.61 seconds
Step 146000/150000, Loss: 4.148672757148742, Test Loss: 4.180585861206055, LR: 5e-06, Elapsed Time: 3327.87 seconds
Step 146100/150000, Loss: 4.151382846832275, Test Loss: 4.180614411830902, LR: 5e-06, Elapsed Time: 3330.12 seconds
Step 146200/150000, Loss: 4.1468266272544865, Test Loss: 4.180597722530365, LR: 5e-06, Elapsed Time: 3332.38 seconds
Step 146300/150000, Loss: 4.153343448638916, Test Loss: 4.180604457855225, LR: 5e-06, Elapsed Time: 3334.63 seconds
Step 146400/150000, Loss: 4.1464552354812625, Test Loss: 4.180599570274353, LR: 5e-06, Elapsed Time: 3336.88 seconds
Step 146500/150000, Loss: 4.146523184776306, Test Loss: 4.1806159019470215, LR: 5e-06, Elapsed Time: 3339.13 seconds
Step 146600/150000, Loss: 4.140141241550445, Test Loss: 4.180782318115234, LR: 5e-06, Elapsed Time: 3341.38 seconds
Step 146700/150000, Loss: 4.163375818729401, Test Loss: 4.18051290512085, LR: 5e-06, Elapsed Time: 3343.63 seconds
Step 146800/150000, Loss: 4.148511486053467, Test Loss: 4.180515646934509, LR: 5e-06, Elapsed Time: 3345.89 seconds
Step 146900/150000, Loss: 4.147431492805481, Test Loss: 4.180420696735382, LR: 5e-06, Elapsed Time: 3348.14 seconds
Step 147000/150000, Loss: 4.15636759519577, Test Loss: 4.180519163608551, LR: 5e-06, Elapsed Time: 3350.40 seconds
Step 147100/150000, Loss: 4.148008179664612, Test Loss: 4.180407285690308, LR: 5e-06, Elapsed Time: 3352.65 seconds
Step 147200/150000, Loss: 4.134121441841126, Test Loss: 4.180474638938904, LR: 5e-06, Elapsed Time: 3354.91 seconds
Step 147300/150000, Loss: 4.145106959342956, Test Loss: 4.180422842502594, LR: 5e-06, Elapsed Time: 3357.16 seconds
Step 147400/150000, Loss: 4.1536610579490665, Test Loss: 4.180434823036194, LR: 5e-06, Elapsed Time: 3359.41 seconds
Step 147500/150000, Loss: 4.141767630577087, Test Loss: 4.180455923080444, LR: 5e-06, Elapsed Time: 3361.67 seconds
Step 147600/150000, Loss: 4.157983636856079, Test Loss: 4.180297672748566, LR: 5e-06, Elapsed Time: 3363.92 seconds
Step 147700/150000, Loss: 4.14756781578064, Test Loss: 4.180255055427551, LR: 5e-06, Elapsed Time: 3366.18 seconds
Step 147800/150000, Loss: 4.143155150413513, Test Loss: 4.180320560932159, LR: 5e-06, Elapsed Time: 3368.43 seconds
Step 147900/150000, Loss: 4.153578977584839, Test Loss: 4.180245220661163, LR: 5e-06, Elapsed Time: 3370.69 seconds
Step 148000/150000, Loss: 4.155347971916199, Test Loss: 4.180192351341248, LR: 5e-06, Elapsed Time: 3372.94 seconds
Step 148100/150000, Loss: 4.150769820213318, Test Loss: 4.180369675159454, LR: 5e-06, Elapsed Time: 3375.18 seconds
Step 148200/150000, Loss: 4.141762051582337, Test Loss: 4.1805577874183655, LR: 5e-06, Elapsed Time: 3377.44 seconds
Step 148300/150000, Loss: 4.139307117462158, Test Loss: 4.180364668369293, LR: 5e-06, Elapsed Time: 3379.69 seconds
Step 148400/150000, Loss: 4.149774701595306, Test Loss: 4.180434763431549, LR: 5e-06, Elapsed Time: 3381.95 seconds
Step 148500/150000, Loss: 4.14942238330841, Test Loss: 4.1803507804870605, LR: 5e-06, Elapsed Time: 3384.19 seconds
Step 148600/150000, Loss: 4.160615088939667, Test Loss: 4.180263042449951, LR: 5e-06, Elapsed Time: 3386.44 seconds
Step 148700/150000, Loss: 4.13036456823349, Test Loss: 4.180291593074799, LR: 5e-06, Elapsed Time: 3388.70 seconds
Step 148800/150000, Loss: 4.13461437702179, Test Loss: 4.1803136467933655, LR: 5e-06, Elapsed Time: 3390.96 seconds
Step 148900/150000, Loss: 4.143134479522705, Test Loss: 4.180439054965973, LR: 5e-06, Elapsed Time: 3393.21 seconds
Step 149000/150000, Loss: 4.143254282474518, Test Loss: 4.180271029472351, LR: 5e-06, Elapsed Time: 3395.46 seconds
Step 149100/150000, Loss: 4.1517451691627505, Test Loss: 4.180273711681366, LR: 5e-06, Elapsed Time: 3397.71 seconds
Step 149200/150000, Loss: 4.146310234069825, Test Loss: 4.180260002613068, LR: 5e-06, Elapsed Time: 3399.96 seconds
Step 149300/150000, Loss: 4.1473937559127805, Test Loss: 4.180250108242035, LR: 5e-06, Elapsed Time: 3402.22 seconds
Step 149400/150000, Loss: 4.139902973175049, Test Loss: 4.180222570896149, LR: 5e-06, Elapsed Time: 3404.47 seconds
Step 149500/150000, Loss: 4.14453691482544, Test Loss: 4.180257201194763, LR: 5e-06, Elapsed Time: 3406.73 seconds
Step 149600/150000, Loss: 4.13150342464447, Test Loss: 4.1803149580955505, LR: 5e-06, Elapsed Time: 3408.98 seconds
Step 149700/150000, Loss: 4.142852101325989, Test Loss: 4.180355608463287, LR: 5e-06, Elapsed Time: 3411.24 seconds
Step 149800/150000, Loss: 4.155404534339905, Test Loss: 4.180193781852722, LR: 5e-06, Elapsed Time: 3413.50 seconds
Step 149900/150000, Loss: 4.14969812631607, Test Loss: 4.180100202560425, LR: 5e-06, Elapsed Time: 3415.75 seconds
Step 150000/150000, Loss: 4.143090476989746, Test Loss: 4.1801371574401855, LR: 5e-06, Elapsed Time: 3418.00 seconds
Saving model checkpoint at step 150000
if use_existing_model:
    print("Existing model used, no loss curves shown.")
    plt.imshow(plt.imread("./loss_curve.png"))
else:
    plt.figure(figsize=(10, 6))
    plt.plot(losses, label="Train Loss", color='blue')
    plt.plot(test_losses, label="Test Loss", color='red')
    plt.xlabel('Checkpoint')
    plt.ylabel('Loss')
    plt.title('Training and Test Loss Over Time')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend()
    plt.show()
Image Output
if not use_existing_model:
    torch.save(model, f"./pretrain_final.pth")

1.4.3 Inference with Pretrained Model

Now that we have pretrained the model, we can perform some inference examples to see what types of outputs we get from the model. We can see that the model is able to output legible english, and most of the words make sense, however, its size limits make it not quite as robust as larger models. It is still good enough to see the "sparks" of understanding language.

In this dataset, we trained on news articles so I've started the sentences with phrases that could potentially be found in the news. If you rerun the cell below this you will see that you get different outputs every time. This is due to the randomness of the next token selection step.

def inference(prompt,torch_model, max_new_tokens):
    torch_model.eval()
    with torch.no_grad():
        tokens = hf_tokenizer.encode(prompt)
        for _ in range(max_new_tokens):
            num_tokens = len(tokens)
            tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens)
            tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
            logits = torch_model(tokens_padded)
            probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1)
            predicted_token = torch.multinomial(probabilities, 1).item()
            tokens.append(predicted_token)
        return hf_tokenizer.decode(tokens)
print("Predicted:", inference("The president signed a bill to pass", model, max_new_tokens=20))
print("Predicted:", inference("There was a large division in", model, max_new_tokens=20))
print("Predicted:", inference("Reports are showing that", model, max_new_tokens=20))
Predicted: The president signed a bill to pass legislation that would allow for tax breaks if enacted into the law' law as it does not allow the
Predicted: There was a large division in his office, probably for up to 40,000 years ago and has been hospitalized with more than 30
Predicted: Reports are showing that cherry-slipped bazers had effectively volunteered. ‘I think we object about the fact

2: Supervised Fine Tuning

To make the model more useable, we can take the pretrained model, and then go through a process called supervised fine tuning. This process involves having high quality supervised text datasets to get the model to respond how we want.

We can use the Fact Q&A dataset from huggingface for this. This dataset consists of question - answer examples that are short, which is good for our use case since we have a small context window of 128 tokens.

Supervised fine tuning is where we can introduce "tags" and other types of text tokens that can help the model understand different roles in the text. For our dataset, we will have a "question" tag and an "answer" tag. We will add all of these when we create our dataset, and also during inference when a user submits a query. We also add eos tokens to end/pad the examples that do not take up the full context window.

After fine tuning on this dataset, ideally we will have a LLM that you can ask a question and get an answer.

# Load dataset in streaming mode
sft_ds = load_dataset("rubenroy/GammaCorpus-Fact-QA-450k", split="train", streaming=True)

def check_sft_dataset_exists():
    try:
        # Attempt to load the dataset with reuse_cache_if_exists mode
        load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
        load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")
        return True
    except FileNotFoundError:
        return False
    
if not check_sft_dataset_exists():
    print("Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.")

    def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, rows=1000):
        """
        Tokenizes and chunks the dataset into fixed-length 512-token segments.
        The 'target' sequence is shifted left by 1 token.
        Stops after generating `train_rows + test_rows` tokenized chunks.
        """
        row_count = 0

        for example in dataset:
            question_plus_answer = "<Question>" + example["question"] + "</Question>" + "<Answer>" + example["answer"] + "</Answer>"
            input_tokens = tokenizer(question_plus_answer, truncation=False, padding=False)['input_ids']

            if row_count >= rows:
                return

            if len(input_tokens) >= chunk_size:
                continue
            else:
                input_tokens = input_tokens +[tokenizer.eos_token_id] * (chunk_size - len(input_tokens))
            
            target_tokens = input_tokens[1:] + [tokenizer.eos_token_id]  # Shifted by 1 token

            yield {
                "input": input_tokens, 
                "target": target_tokens
            }
            
            row_count += 1

    # Set the max number of rows for training and testing
    TRAIN_ROWS = 440000  # Adjust as needed
    TEST_ROWS = 500   # Adjust as needed
    CHUNK_SIZE = 128

    # Convert generator to a Hugging Face Dataset
    tokenized_sft_dataset = Dataset.from_generator(lambda: tokenize_and_chunk(sft_ds, hf_tokenizer,chunk_size=CHUNK_SIZE, rows=TRAIN_ROWS + TEST_ROWS))

    # Split the dataset into `train` and `test`
    sft_dataset_splits = tokenized_sft_dataset.train_test_split(train_size=TRAIN_ROWS, test_size=TEST_ROWS, seed=42)

    # Save to disk
    sft_dataset_splits["train"].to_parquet("fact_qa_train.parquet")
    sft_dataset_splits["test"].to_parquet("fact_qa_test.parquet")

    print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows for supervised fine tuning.")
else:
    print("SFT Tokenized dataset already exists locally.")
README.md: 0.00B [00:00, ?B/s]
Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]
Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]
✅ Saved 440000 train rows and 500 test rows for supervised fine tuning.

2.1 Supervised Fine Tuning Training Loop

A very similar training loop can be used for supervised fine tuning.

# Example config:
batch_size = 64
sequence_len = 128
num_steps = 50000
accumulation_steps = 100

# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")

# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])

# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)


use_existing_model = os.path.exists("./sft_final.pth")
# Check if pre-trained model exists
if use_existing_model:
    model = torch.load("./sft_final.pth", weights_only=False)
    print("Loaded fine tuned model from ./sft_final.pth, skipping training loop.")

else:
    # For SFT we start with the pretrained model
    model = torch.load("./pretrain_final.pth", weights_only=False)
    
    # Define the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)


    # Scheduler with dynamic step size
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=10, min_lr=5e-6, threshold=1e-4)


    # Training loop
    losses = []
    test_losses = []
    accumulator = 0
    accumulator_loss = 0
    for i in range(num_steps):
        model.train()
        example = next(train_dataloader)
        train_input = example["input"].to(device)
        train_target = example["target"].to(device)
        logits = model(train_input)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Update weights
        optimizer.step()
        optimizer.zero_grad()

        accumulator += 1
        accumulator_loss += loss.item()

        

        if accumulator >= accumulation_steps:
            losses.append(accumulator_loss / accumulation_steps)
            accumulator = 0
            accumulator_loss = 0
            model.eval()
            test_loss = 0
            test_accumulator = 0
            with torch.no_grad():
                for test_example in test_dataloader:
                    test_input = test_example["input"].to(device)
                    test_target = test_example["target"].to(device)
                    test_logits = model(test_input)
                    test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
                    test_accumulator += 1
                test_losses.append(test_loss / test_accumulator)
                print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}")
                test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
                scheduler.step(test_losses[-1])

        if i+1 % 50000 == 0:
            torch.save(model.state_dict(), f"./sft_model_checkpoint_{i}.pt")
        
Step 100/50000, Loss: 1.7894065862894057, Test Loss: 0.6899487301707268
Step 200/50000, Loss: 0.670128984451294, Test Loss: 0.6649056524038315
Step 300/50000, Loss: 0.6577387464046478, Test Loss: 0.655867725610733
Step 400/50000, Loss: 0.6484179145097733, Test Loss: 0.6457256749272346
Step 500/50000, Loss: 0.6337686544656753, Test Loss: 0.6401033475995064
Step 600/50000, Loss: 0.6327831310033798, Test Loss: 0.6335209831595421
Step 700/50000, Loss: 0.6319840627908707, Test Loss: 0.6287727132439613
Step 800/50000, Loss: 0.629341772198677, Test Loss: 0.6234856620430946
Step 900/50000, Loss: 0.618602660894394, Test Loss: 0.6195669993758202
Step 1000/50000, Loss: 0.6160968893766403, Test Loss: 0.6167683377861977
Step 1100/50000, Loss: 0.6157135808467865, Test Loss: 0.6147342845797539
Step 1200/50000, Loss: 0.6101788258552552, Test Loss: 0.612005814909935
Step 1300/50000, Loss: 0.6089721620082855, Test Loss: 0.6109145656228065
Step 1400/50000, Loss: 0.6084770160913467, Test Loss: 0.607772946357727
Step 1500/50000, Loss: 0.6058419317007064, Test Loss: 0.6048060208559036
Step 1600/50000, Loss: 0.5959779250621796, Test Loss: 0.6020506843924522
Step 1700/50000, Loss: 0.6007200157642365, Test Loss: 0.6022788658738136
Step 1800/50000, Loss: 0.5974611109495163, Test Loss: 0.6012213677167892
Step 1900/50000, Loss: 0.5952364253997803, Test Loss: 0.5988002642989159
Step 2000/50000, Loss: 0.593831394314766, Test Loss: 0.5949253216385841
Step 2100/50000, Loss: 0.5941272938251495, Test Loss: 0.5931232944130898
Step 2200/50000, Loss: 0.591301035284996, Test Loss: 0.591670885682106
Step 2300/50000, Loss: 0.5909951883554458, Test Loss: 0.5886573493480682
Step 2400/50000, Loss: 0.5865880775451661, Test Loss: 0.5890826284885406
Step 2500/50000, Loss: 0.5821933543682098, Test Loss: 0.5872417092323303
Step 2600/50000, Loss: 0.5839130717515946, Test Loss: 0.5845912173390388
Step 2700/50000, Loss: 0.5817171996831894, Test Loss: 0.5846689864993095
Step 2800/50000, Loss: 0.5854867315292358, Test Loss: 0.584762692451477
Step 2900/50000, Loss: 0.5824162870645523, Test Loss: 0.5827516466379166
Step 3000/50000, Loss: 0.5816735351085662, Test Loss: 0.5794886350631714
Step 3100/50000, Loss: 0.5774682822823525, Test Loss: 0.580800287425518
Step 3200/50000, Loss: 0.5784483242034912, Test Loss: 0.5820211842656136
Step 3300/50000, Loss: 0.5794898337125778, Test Loss: 0.5806005746126175
Step 3400/50000, Loss: 0.5787150442600251, Test Loss: 0.5781652107834816
Step 3500/50000, Loss: 0.5767926639318466, Test Loss: 0.5785172060132027
Step 3600/50000, Loss: 0.5774986898899078, Test Loss: 0.5784606635570526
Step 3700/50000, Loss: 0.5698395538330078, Test Loss: 0.5762661471962929
Step 3800/50000, Loss: 0.5723054140806199, Test Loss: 0.5732337087392807
Step 3900/50000, Loss: 0.5704968816041946, Test Loss: 0.573347195982933
Step 4000/50000, Loss: 0.5742578566074371, Test Loss: 0.5710273906588554
Step 4100/50000, Loss: 0.5689345055818558, Test Loss: 0.5699852406978607
Step 4200/50000, Loss: 0.5685835617780686, Test Loss: 0.5711279734969139
Step 4300/50000, Loss: 0.5643544238805771, Test Loss: 0.5684011429548264
Step 4400/50000, Loss: 0.5655131494998932, Test Loss: 0.5685702115297318
Step 4500/50000, Loss: 0.5658017551898956, Test Loss: 0.5672628208994865
Step 4600/50000, Loss: 0.5628653198480607, Test Loss: 0.5666623264551163
Step 4700/50000, Loss: 0.5685244762897491, Test Loss: 0.5647096410393715
Step 4800/50000, Loss: 0.5635990315675735, Test Loss: 0.5669257789850235
Step 4900/50000, Loss: 0.5676164382696152, Test Loss: 0.5649148002266884
Step 5000/50000, Loss: 0.5633213102817536, Test Loss: 0.5636829733848572
Step 5100/50000, Loss: 0.5604616793990135, Test Loss: 0.5637211948633194
Step 5200/50000, Loss: 0.5641232460737229, Test Loss: 0.5615629330277443
Step 5300/50000, Loss: 0.5570856201648712, Test Loss: 0.5636357069015503
Step 5400/50000, Loss: 0.5624988043308258, Test Loss: 0.5596609488129616
Step 5500/50000, Loss: 0.5551352027058601, Test Loss: 0.5593733713030815
Step 5600/50000, Loss: 0.5606354176998138, Test Loss: 0.5590610057115555
Step 5700/50000, Loss: 0.5523125407099724, Test Loss: 0.5590964779257774
Step 5800/50000, Loss: 0.557363595366478, Test Loss: 0.5583673641085625
Step 5900/50000, Loss: 0.5576430636644364, Test Loss: 0.5588566064834595
Step 6000/50000, Loss: 0.5533191740512848, Test Loss: 0.5587872043251991
Step 6100/50000, Loss: 0.5526760491728783, Test Loss: 0.5563570559024811
Step 6200/50000, Loss: 0.5569000604748726, Test Loss: 0.5565835759043694
Step 6300/50000, Loss: 0.5561074531078338, Test Loss: 0.553500734269619
Step 6400/50000, Loss: 0.5481549808382988, Test Loss: 0.5524075031280518
Step 6500/50000, Loss: 0.5519813224673271, Test Loss: 0.5524388253688812
Step 6600/50000, Loss: 0.5500669786334038, Test Loss: 0.5516507029533386
Step 6700/50000, Loss: 0.5529505521059036, Test Loss: 0.5526987984776497
Step 6800/50000, Loss: 0.5495224264264107, Test Loss: 0.5512131825089455
Step 6900/50000, Loss: 0.546148627102375, Test Loss: 0.5518527403473854
Step 7000/50000, Loss: 0.5403213331103325, Test Loss: 0.5508043244481087
Step 7100/50000, Loss: 0.5375803881883621, Test Loss: 0.5507242307066917
Step 7200/50000, Loss: 0.5373097231984139, Test Loss: 0.5512633696198463
Step 7300/50000, Loss: 0.5378788873553276, Test Loss: 0.5511386767029762
Step 7400/50000, Loss: 0.5346931657195091, Test Loss: 0.5513263121247292
Step 7500/50000, Loss: 0.536243808567524, Test Loss: 0.5502893775701523
Step 7600/50000, Loss: 0.5417343014478684, Test Loss: 0.5490069314837456
Step 7700/50000, Loss: 0.5377320700883865, Test Loss: 0.5479213520884514
Step 7800/50000, Loss: 0.5326224866509438, Test Loss: 0.5472695827484131
Step 7900/50000, Loss: 0.5382641625404357, Test Loss: 0.54770527780056
Step 8000/50000, Loss: 0.5327157139778137, Test Loss: 0.5483704879879951
Step 8100/50000, Loss: 0.53377620190382, Test Loss: 0.5477786287665367
Step 8200/50000, Loss: 0.5340671905875206, Test Loss: 0.5471154898405075
Step 8300/50000, Loss: 0.5355316486954689, Test Loss: 0.5471793934702873
Step 8400/50000, Loss: 0.5301006320118904, Test Loss: 0.5461340025067329
Step 8500/50000, Loss: 0.5247205343842506, Test Loss: 0.5459344834089279
Step 8600/50000, Loss: 0.5329261514544487, Test Loss: 0.5460227504372597
Step 8700/50000, Loss: 0.531003688275814, Test Loss: 0.5452506318688393
Step 8800/50000, Loss: 0.5302336066961288, Test Loss: 0.5453066304326057
Step 8900/50000, Loss: 0.5285433578491211, Test Loss: 0.544139675796032
Step 9000/50000, Loss: 0.5326492401957512, Test Loss: 0.5432911440730095
Step 9100/50000, Loss: 0.5281846457719803, Test Loss: 0.5422489270567894
Step 9200/50000, Loss: 0.5273299798369407, Test Loss: 0.5413840934634209
Step 9300/50000, Loss: 0.5258684808015823, Test Loss: 0.5431383922696114
Step 9400/50000, Loss: 0.5253734466433525, Test Loss: 0.5410608500242233
Step 9500/50000, Loss: 0.5250219839811325, Test Loss: 0.5397867262363434
Step 9600/50000, Loss: 0.5252009320259095, Test Loss: 0.5413575246930122
Step 9700/50000, Loss: 0.5261444211006164, Test Loss: 0.5417118743062019
Step 9800/50000, Loss: 0.5283768391609192, Test Loss: 0.5390029847621918
Step 9900/50000, Loss: 0.5264323997497559, Test Loss: 0.5391514673829079
Step 10000/50000, Loss: 0.5207903230190277, Test Loss: 0.539966531097889
Step 10100/50000, Loss: 0.5296834912896157, Test Loss: 0.539964348077774
Step 10200/50000, Loss: 0.5260419407486916, Test Loss: 0.5388037040829659
Step 10300/50000, Loss: 0.5255704522132874, Test Loss: 0.5408822074532509
Step 10400/50000, Loss: 0.5260245826840401, Test Loss: 0.5401119887828827
Step 10500/50000, Loss: 0.5253755643963813, Test Loss: 0.5392967835068703
Step 10600/50000, Loss: 0.5215079271793366, Test Loss: 0.5370707809925079
Step 10700/50000, Loss: 0.5210157689452172, Test Loss: 0.5376391485333443
Step 10800/50000, Loss: 0.5194737631082534, Test Loss: 0.5357342511415482
Step 10900/50000, Loss: 0.526770293712616, Test Loss: 0.5352799519896507
Step 11000/50000, Loss: 0.5204985249042511, Test Loss: 0.5359285399317741
Step 11100/50000, Loss: 0.5214085018634796, Test Loss: 0.5379253625869751
Step 11200/50000, Loss: 0.5164320465922355, Test Loss: 0.5349118411540985
Step 11300/50000, Loss: 0.519132778942585, Test Loss: 0.5369766876101494
Step 11400/50000, Loss: 0.518926048874855, Test Loss: 0.536350816488266
Step 11500/50000, Loss: 0.5197267335653305, Test Loss: 0.5340079292654991
Step 11600/50000, Loss: 0.5219586622714997, Test Loss: 0.5338058620691299
Step 11700/50000, Loss: 0.520624511539936, Test Loss: 0.5356609150767326
Step 11800/50000, Loss: 0.5220979690551758, Test Loss: 0.5347660779953003
Step 11900/50000, Loss: 0.5167272743582726, Test Loss: 0.5345161184668541
Step 12000/50000, Loss: 0.5206729030609131, Test Loss: 0.5343651175498962
Step 12100/50000, Loss: 0.5164869773387909, Test Loss: 0.5345569625496864
Step 12200/50000, Loss: 0.5169678327441215, Test Loss: 0.5346328243613243
Step 12300/50000, Loss: 0.518125604391098, Test Loss: 0.5323115363717079
Step 12400/50000, Loss: 0.5130830618739128, Test Loss: 0.5332615375518799
Step 12500/50000, Loss: 0.51795594394207, Test Loss: 0.5330265313386917
Step 12600/50000, Loss: 0.5149968475103378, Test Loss: 0.5337191671133041
Step 12700/50000, Loss: 0.5142616364359855, Test Loss: 0.5329175069928169
Step 12800/50000, Loss: 0.5163176089525223, Test Loss: 0.5334835201501846
Step 12900/50000, Loss: 0.5130072566866875, Test Loss: 0.5330120623111725
Step 13000/50000, Loss: 0.5139851236343383, Test Loss: 0.5317537263035774
Step 13100/50000, Loss: 0.5176787361502647, Test Loss: 0.5321345776319504
Step 13200/50000, Loss: 0.5140871369838714, Test Loss: 0.5301696360111237
Step 13300/50000, Loss: 0.513447439968586, Test Loss: 0.5283944457769394
Step 13400/50000, Loss: 0.5100165858864785, Test Loss: 0.5297116637229919
Step 13500/50000, Loss: 0.5150577172636985, Test Loss: 0.5288778468966484
Step 13600/50000, Loss: 0.5141728615760803, Test Loss: 0.530273400247097
Step 13700/50000, Loss: 0.5110060259699821, Test Loss: 0.5297853127121925
Step 13800/50000, Loss: 0.509751378595829, Test Loss: 0.5308617800474167
Step 13900/50000, Loss: 0.5054431283473968, Test Loss: 0.5299336761236191
Step 14000/50000, Loss: 0.5064200758934021, Test Loss: 0.5309884995222092
Step 14100/50000, Loss: 0.5044380423426628, Test Loss: 0.5301645696163177
Step 14200/50000, Loss: 0.5068054673075676, Test Loss: 0.5314349606633186
Step 14300/50000, Loss: 0.5026483401656151, Test Loss: 0.5308374911546707
Step 14400/50000, Loss: 0.5044065856933594, Test Loss: 0.5297647640109062
Step 14500/50000, Loss: 0.5074339452385902, Test Loss: 0.5241239964962006
Step 14600/50000, Loss: 0.5012692919373513, Test Loss: 0.5212065055966377
Step 14700/50000, Loss: 0.49376322239637377, Test Loss: 0.5196936130523682
Step 14800/50000, Loss: 0.4990583032369614, Test Loss: 0.5181355103850365
Step 14900/50000, Loss: 0.4927079975605011, Test Loss: 0.5170556530356407
Step 15000/50000, Loss: 0.4943768587708473, Test Loss: 0.5164052471518517
Step 15100/50000, Loss: 0.49292590498924255, Test Loss: 0.5158631876111031
Step 15200/50000, Loss: 0.49274809896945954, Test Loss: 0.5155334323644638
Step 15300/50000, Loss: 0.48627577275037764, Test Loss: 0.5148376375436783
Step 15400/50000, Loss: 0.4847521889209747, Test Loss: 0.5141593888401985
Step 15500/50000, Loss: 0.49133633136749266, Test Loss: 0.5137338601052761
Step 15600/50000, Loss: 0.4898805782198906, Test Loss: 0.5132023245096207
Step 15700/50000, Loss: 0.48766000390052794, Test Loss: 0.5126618780195713
Step 15800/50000, Loss: 0.48520185083150863, Test Loss: 0.5120491683483124
Step 15900/50000, Loss: 0.487665154337883, Test Loss: 0.5117187798023224
Step 16000/50000, Loss: 0.4880859047174454, Test Loss: 0.5108061395585537
Step 16100/50000, Loss: 0.4845232370495796, Test Loss: 0.5108256340026855
Step 16200/50000, Loss: 0.4834109058976173, Test Loss: 0.5106264725327492
Step 16300/50000, Loss: 0.4823460331559181, Test Loss: 0.5101854875683784
Step 16400/50000, Loss: 0.48043170630931853, Test Loss: 0.509525828063488
Step 16500/50000, Loss: 0.48331175357103345, Test Loss: 0.509604599326849
Step 16600/50000, Loss: 0.4839912721514702, Test Loss: 0.5099755600094795
Step 16700/50000, Loss: 0.4853965476155281, Test Loss: 0.5088035129010677
Step 16800/50000, Loss: 0.4794876632094383, Test Loss: 0.5091969855129719
Step 16900/50000, Loss: 0.4786899435520172, Test Loss: 0.5092427730560303
Step 17000/50000, Loss: 0.48619492262601854, Test Loss: 0.5084580108523369
Step 17100/50000, Loss: 0.48328746378421783, Test Loss: 0.5086438395082951
Step 17200/50000, Loss: 0.4803727975487709, Test Loss: 0.509333610534668
Step 17300/50000, Loss: 0.4814123231172562, Test Loss: 0.5090706311166286
Step 17400/50000, Loss: 0.48198975682258605, Test Loss: 0.5086535513401031
Step 17500/50000, Loss: 0.47598943769931795, Test Loss: 0.5084004029631615
Step 17600/50000, Loss: 0.4715538877248764, Test Loss: 0.5084105804562569
Step 17700/50000, Loss: 0.4818730938434601, Test Loss: 0.5075912587344646
Step 17800/50000, Loss: 0.47702885448932647, Test Loss: 0.5073939561843872
Step 17900/50000, Loss: 0.4776961186528206, Test Loss: 0.5074217617511749
Step 18000/50000, Loss: 0.47445029467344285, Test Loss: 0.5084492862224579
Step 18100/50000, Loss: 0.473343625664711, Test Loss: 0.5077690184116364
Step 18200/50000, Loss: 0.4756625533103943, Test Loss: 0.5075971633195877
Step 18300/50000, Loss: 0.4731039464473724, Test Loss: 0.5075630210340023
Step 18400/50000, Loss: 0.47504206210374833, Test Loss: 0.5072060152888298
Step 18500/50000, Loss: 0.4745807924866676, Test Loss: 0.5067192129790783
Step 18600/50000, Loss: 0.4756285485625267, Test Loss: 0.5072766505181789
Step 18700/50000, Loss: 0.47526715040206907, Test Loss: 0.5070482343435287
Step 18800/50000, Loss: 0.46997760981321335, Test Loss: 0.5069802701473236
Step 18900/50000, Loss: 0.47427710562944414, Test Loss: 0.5074125342071056
Step 19000/50000, Loss: 0.4689899322390556, Test Loss: 0.5073340982198715
Step 19100/50000, Loss: 0.47052473068237305, Test Loss: 0.5075459778308868
Step 19200/50000, Loss: 0.47141150444746016, Test Loss: 0.506905984133482
Step 19300/50000, Loss: 0.4689398431777954, Test Loss: 0.5068896487355232
Step 19400/50000, Loss: 0.4683458387851715, Test Loss: 0.506644032895565
Step 19500/50000, Loss: 0.46928752213716507, Test Loss: 0.5076656304299831
Step 19600/50000, Loss: 0.4651474866271019, Test Loss: 0.5069929584860802
Step 19700/50000, Loss: 0.46936661541461944, Test Loss: 0.5073273032903671
Step 19800/50000, Loss: 0.4647279506921768, Test Loss: 0.5072552412748337
Step 19900/50000, Loss: 0.46571204215288164, Test Loss: 0.5073504708707333
Step 20000/50000, Loss: 0.46870594292879103, Test Loss: 0.50682008638978
Step 20100/50000, Loss: 0.4637542328238487, Test Loss: 0.5065371878445148
Step 20200/50000, Loss: 0.46414817243814466, Test Loss: 0.5061508901417255
Step 20300/50000, Loss: 0.4624897116422653, Test Loss: 0.5061679780483246
Step 20400/50000, Loss: 0.46554854303598403, Test Loss: 0.5060576274991035
Step 20500/50000, Loss: 0.46323122799396516, Test Loss: 0.5067083314061165
Step 20600/50000, Loss: 0.45778128325939177, Test Loss: 0.5072311423718929
Step 20700/50000, Loss: 0.46135765373706816, Test Loss: 0.507310077548027
Step 20800/50000, Loss: 0.4527816712856293, Test Loss: 0.507438138127327
Step 20900/50000, Loss: 0.45377990990877154, Test Loss: 0.5076587907969952
Step 21000/50000, Loss: 0.45502029210329054, Test Loss: 0.5075156912207603
Step 21100/50000, Loss: 0.4520345029234886, Test Loss: 0.5088563002645969
Step 21200/50000, Loss: 0.4501872006058693, Test Loss: 0.5083819702267647
Step 21300/50000, Loss: 0.4599399706721306, Test Loss: 0.5069945603609085
Step 21400/50000, Loss: 0.4857947745919228, Test Loss: 0.5063240826129913
Step 21500/50000, Loss: 0.47852746993303297, Test Loss: 0.5060458853840828
Step 21600/50000, Loss: 0.4717205885052681, Test Loss: 0.5052629262208939
Step 21700/50000, Loss: 0.47905307918787005, Test Loss: 0.5048664696514606
Step 21800/50000, Loss: 0.473784456551075, Test Loss: 0.5045835003256798
Step 21900/50000, Loss: 0.4764270320534706, Test Loss: 0.5043857023119926
Step 22000/50000, Loss: 0.47761034458875656, Test Loss: 0.5042622461915016
Step 22100/50000, Loss: 0.47346028447151184, Test Loss: 0.5041699074208736
Step 22200/50000, Loss: 0.46848847061395643, Test Loss: 0.5041247010231018
Step 22300/50000, Loss: 0.47570513159036637, Test Loss: 0.5038448497653008
Step 22400/50000, Loss: 0.4697534912824631, Test Loss: 0.5038831830024719
Step 22500/50000, Loss: 0.4763588014245033, Test Loss: 0.5035555101931095
Step 22600/50000, Loss: 0.4737188172340393, Test Loss: 0.5034327544271946
Step 22700/50000, Loss: 0.47131730645895004, Test Loss: 0.5033349767327309
Step 22800/50000, Loss: 0.47094243377447126, Test Loss: 0.5031238570809364
Step 22900/50000, Loss: 0.47399418413639066, Test Loss: 0.502818088978529
Step 23000/50000, Loss: 0.46959485530853273, Test Loss: 0.5028169304132462
Step 23100/50000, Loss: 0.46698780059814454, Test Loss: 0.5028919242322445
Step 23200/50000, Loss: 0.46983315378427504, Test Loss: 0.5027219019830227
Step 23300/50000, Loss: 0.465038628578186, Test Loss: 0.5025379881262779
Step 23400/50000, Loss: 0.47093255668878553, Test Loss: 0.5025543235242367
Step 23500/50000, Loss: 0.4711631014943123, Test Loss: 0.5025326684117317
Step 23600/50000, Loss: 0.47176077038049696, Test Loss: 0.5023802779614925
Step 23700/50000, Loss: 0.46660762906074527, Test Loss: 0.5024596005678177
Step 23800/50000, Loss: 0.4664703053236008, Test Loss: 0.5025564096868038
Step 23900/50000, Loss: 0.4703067201375961, Test Loss: 0.5023491904139519
Step 24000/50000, Loss: 0.4716689053177834, Test Loss: 0.5022483840584755
Step 24100/50000, Loss: 0.46665910333395005, Test Loss: 0.5023765712976456
Step 24200/50000, Loss: 0.4694711676239967, Test Loss: 0.5024354085326195
Step 24300/50000, Loss: 0.4672688579559326, Test Loss: 0.5024163909256458
Step 24400/50000, Loss: 0.4629089653491974, Test Loss: 0.5023695789277554
Step 24500/50000, Loss: 0.46146546363830565, Test Loss: 0.5023573376238346
Step 24600/50000, Loss: 0.4693963986635208, Test Loss: 0.5020553283393383
Step 24700/50000, Loss: 0.46189400404691694, Test Loss: 0.5021664276719093
Step 24800/50000, Loss: 0.46608135670423506, Test Loss: 0.5022244416177273
Step 24900/50000, Loss: 0.4589571440219879, Test Loss: 0.5024607218801975
Step 25000/50000, Loss: 0.4629048019647598, Test Loss: 0.5021452978253365
Step 25100/50000, Loss: 0.4642249670624733, Test Loss: 0.5021775141358376
Step 25200/50000, Loss: 0.4602755627036095, Test Loss: 0.5020267590880394
Step 25300/50000, Loss: 0.4630341744422913, Test Loss: 0.5020380951464176
Step 25400/50000, Loss: 0.46108362942934034, Test Loss: 0.5019224025309086
Step 25500/50000, Loss: 0.46431060671806335, Test Loss: 0.5020202063024044
Step 25600/50000, Loss: 0.46064938127994537, Test Loss: 0.5021353252232075
Step 25700/50000, Loss: 0.4596227452158928, Test Loss: 0.5019978806376457
Step 25800/50000, Loss: 0.4627624672651291, Test Loss: 0.5022884383797646
Step 25900/50000, Loss: 0.45465937197208406, Test Loss: 0.5022797510027885
Step 26000/50000, Loss: 0.4602247479557991, Test Loss: 0.5023479834198952
Step 26100/50000, Loss: 0.4547144192457199, Test Loss: 0.5023751519620419
Step 26200/50000, Loss: 0.45940943896770475, Test Loss: 0.5021818913519382
Step 26300/50000, Loss: 0.45368317008018494, Test Loss: 0.5023400746285915
Step 26400/50000, Loss: 0.45632135450839995, Test Loss: 0.5024513900279999
Step 26500/50000, Loss: 0.4554154166579247, Test Loss: 0.5024019181728363
Step 26600/50000, Loss: 0.4558100646734238, Test Loss: 0.5024100318551064
Step 26700/50000, Loss: 0.4524954304099083, Test Loss: 0.5024291835725307
Step 26800/50000, Loss: 0.4520637735724449, Test Loss: 0.5024739019572735
Step 26900/50000, Loss: 0.45630247265100476, Test Loss: 0.5023978725075722
Step 27000/50000, Loss: 0.45039484471082686, Test Loss: 0.502426128834486
Step 27100/50000, Loss: 0.45053510785102846, Test Loss: 0.5023452937602997
Step 27200/50000, Loss: 0.451779320538044, Test Loss: 0.502366878092289
Step 27300/50000, Loss: 0.45302672177553177, Test Loss: 0.5023815371096134
Step 27400/50000, Loss: 0.4481547796726227, Test Loss: 0.5024418719112873
Step 27500/50000, Loss: 0.44427485674619677, Test Loss: 0.5025362223386765
Step 27600/50000, Loss: 0.4456496116518974, Test Loss: 0.5025747939944267
Step 27700/50000, Loss: 0.4400458693504333, Test Loss: 0.5025870725512505
Step 27800/50000, Loss: 0.44037798076868057, Test Loss: 0.5026613883674145
Step 27900/50000, Loss: 0.44103302150964735, Test Loss: 0.5027294494211674
Step 28000/50000, Loss: 0.4356226268410683, Test Loss: 0.5029292479157448
Step 28100/50000, Loss: 0.4362826246023178, Test Loss: 0.5030244551599026
Step 28200/50000, Loss: 0.4531803122162819, Test Loss: 0.5026777647435665
Step 28300/50000, Loss: 0.4652056521177292, Test Loss: 0.5024219490587711
Step 28400/50000, Loss: 0.4623024901747704, Test Loss: 0.5021953955292702
Step 28500/50000, Loss: 0.4698593419790268, Test Loss: 0.5019765570759773
Step 28600/50000, Loss: 0.4713398265838623, Test Loss: 0.5018728114664555
Step 28700/50000, Loss: 0.46969524592161177, Test Loss: 0.5017674267292023
Step 28800/50000, Loss: 0.47093317002058027, Test Loss: 0.5017237216234207
Step 28900/50000, Loss: 0.471667223572731, Test Loss: 0.5016866102814674
Step 29000/50000, Loss: 0.46903841853141787, Test Loss: 0.5016762614250183
Step 29100/50000, Loss: 0.4629881393909454, Test Loss: 0.5016997158527374
Step 29200/50000, Loss: 0.46923464447259905, Test Loss: 0.5016297623515129
Step 29300/50000, Loss: 0.46737633228302, Test Loss: 0.5016226582229137
Step 29400/50000, Loss: 0.4676260563731194, Test Loss: 0.5015422664582729
Step 29500/50000, Loss: 0.4677749601006508, Test Loss: 0.5015387944877148
Step 29600/50000, Loss: 0.4689545515179634, Test Loss: 0.5015231557190418
Step 29700/50000, Loss: 0.4665097558498383, Test Loss: 0.5014706775546074
Step 29800/50000, Loss: 0.46871524572372436, Test Loss: 0.501375675201416
Step 29900/50000, Loss: 0.46577505141496656, Test Loss: 0.5013439916074276
Step 30000/50000, Loss: 0.4621674692630768, Test Loss: 0.5013883039355278
Step 30100/50000, Loss: 0.46299754798412324, Test Loss: 0.5013499930500984
Step 30200/50000, Loss: 0.46211815655231475, Test Loss: 0.5013172663748264
Step 30300/50000, Loss: 0.4665739831328392, Test Loss: 0.5013313479721546
Step 30400/50000, Loss: 0.4669671383500099, Test Loss: 0.5013204962015152
Step 30500/50000, Loss: 0.4651323547959328, Test Loss: 0.5013248324394226
Step 30600/50000, Loss: 0.46224466621875765, Test Loss: 0.5013634636998177
Step 30700/50000, Loss: 0.46512631088495254, Test Loss: 0.5013839080929756
Step 30800/50000, Loss: 0.46646908432245254, Test Loss: 0.5013682022690773
Step 30900/50000, Loss: 0.4650062966346741, Test Loss: 0.5013505108654499
Step 31000/50000, Loss: 0.4644706362485886, Test Loss: 0.501357588917017
Step 31100/50000, Loss: 0.4648225772380829, Test Loss: 0.501377671957016
Step 31200/50000, Loss: 0.45945363998413086, Test Loss: 0.5014038048684597
Step 31300/50000, Loss: 0.45951696336269376, Test Loss: 0.5014193244278431
Step 31400/50000, Loss: 0.4602182424068451, Test Loss: 0.5014018975198269
Step 31500/50000, Loss: 0.4645255380868912, Test Loss: 0.5013576708734035
Step 31600/50000, Loss: 0.4584848949313164, Test Loss: 0.5014085695147514
Step 31700/50000, Loss: 0.459740195274353, Test Loss: 0.5014818608760834
Step 31800/50000, Loss: 0.4567695745825768, Test Loss: 0.5015067383646965
Step 31900/50000, Loss: 0.4580450391769409, Test Loss: 0.5014513395726681
Step 32000/50000, Loss: 0.458971663415432, Test Loss: 0.5014667585492134
Step 32100/50000, Loss: 0.4564270082116127, Test Loss: 0.5014294870197773
Step 32200/50000, Loss: 0.46097953617572784, Test Loss: 0.5014451667666435
Step 32300/50000, Loss: 0.4567831841111183, Test Loss: 0.5014963112771511
Step 32400/50000, Loss: 0.45923821568489076, Test Loss: 0.5014836527407169
Step 32500/50000, Loss: 0.45774698346853254, Test Loss: 0.5014729984104633
Step 32600/50000, Loss: 0.45524754852056504, Test Loss: 0.5014785639941692
Step 32700/50000, Loss: 0.45761937320232393, Test Loss: 0.5015468373894691
Step 32800/50000, Loss: 0.45148681819438935, Test Loss: 0.5016140043735504
Step 32900/50000, Loss: 0.45721014618873596, Test Loss: 0.5016286261379719
Step 33000/50000, Loss: 0.45024279206991197, Test Loss: 0.5016967579722404
Step 33100/50000, Loss: 0.45531607180833816, Test Loss: 0.5016856268048286
Step 33200/50000, Loss: 0.44876173973083494, Test Loss: 0.5017441064119339
Step 33300/50000, Loss: 0.45277988761663435, Test Loss: 0.5017973519861698
Step 33400/50000, Loss: 0.4526422739028931, Test Loss: 0.5018364116549492
Step 33500/50000, Loss: 0.45178504049777984, Test Loss: 0.5018700696527958
Step 33600/50000, Loss: 0.4524467149376869, Test Loss: 0.5018718093633652
Step 33700/50000, Loss: 0.4544000518321991, Test Loss: 0.5018896907567978
Step 33800/50000, Loss: 0.4542145425081253, Test Loss: 0.5018197856843472
Step 33900/50000, Loss: 0.4472826811671257, Test Loss: 0.5018804147839546
Step 34000/50000, Loss: 0.4509144797921181, Test Loss: 0.501849852502346
Step 34100/50000, Loss: 0.4498064476251602, Test Loss: 0.5018765665590763
Step 34200/50000, Loss: 0.45064257711172107, Test Loss: 0.5018821284174919
Step 34300/50000, Loss: 0.44785331904888154, Test Loss: 0.5019666813313961
Step 34400/50000, Loss: 0.44483387500047683, Test Loss: 0.5020587854087353
Step 34500/50000, Loss: 0.44307459115982056, Test Loss: 0.5021104216575623
Step 34600/50000, Loss: 0.44084274530410766, Test Loss: 0.5021173171699047
Step 34700/50000, Loss: 0.43833828091621396, Test Loss: 0.5022791922092438
Step 34800/50000, Loss: 0.4400048726797104, Test Loss: 0.5022997856140137
Step 34900/50000, Loss: 0.4360825061798096, Test Loss: 0.502533707767725
Step 35000/50000, Loss: 0.4363590368628502, Test Loss: 0.5026167184114456
Step 35100/50000, Loss: 0.4613098162412643, Test Loss: 0.5021704509854317
Step 35200/50000, Loss: 0.4619985839724541, Test Loss: 0.5019823834300041
Step 35300/50000, Loss: 0.46362817764282227, Test Loss: 0.501716960221529
Step 35400/50000, Loss: 0.47281751930713656, Test Loss: 0.5015313476324081
Step 35500/50000, Loss: 0.46824741512537005, Test Loss: 0.5014343932271004
Step 35600/50000, Loss: 0.46984292566776276, Test Loss: 0.5013517029583454
Step 35700/50000, Loss: 0.4689937967061997, Test Loss: 0.5013322308659554
Step 35800/50000, Loss: 0.4711910900473595, Test Loss: 0.5013112761080265
Step 35900/50000, Loss: 0.4655409336090088, Test Loss: 0.5012953095138073
Step 36000/50000, Loss: 0.4612109282612801, Test Loss: 0.5013049617409706
Step 36100/50000, Loss: 0.46956610172986984, Test Loss: 0.5012511648237705
Step 36200/50000, Loss: 0.4674121195077896, Test Loss: 0.5012303367257118
Step 36300/50000, Loss: 0.4668622562289238, Test Loss: 0.5011749118566513
Step 36400/50000, Loss: 0.4656849908828735, Test Loss: 0.5011928081512451
Step 36500/50000, Loss: 0.46981538653373717, Test Loss: 0.5011161416769028
Step 36600/50000, Loss: 0.4653474897146225, Test Loss: 0.501069936901331
Step 36700/50000, Loss: 0.46546988666057587, Test Loss: 0.501018974930048
Step 36800/50000, Loss: 0.4640503132343292, Test Loss: 0.500993836671114
Step 36900/50000, Loss: 0.46271914124488833, Test Loss: 0.5010315030813217
Step 37000/50000, Loss: 0.4625481230020523, Test Loss: 0.5009903497993946
Step 37100/50000, Loss: 0.46249220192432405, Test Loss: 0.5009798556566238
Step 37200/50000, Loss: 0.46419251829385755, Test Loss: 0.5010003373026848
Step 37300/50000, Loss: 0.46661396145820616, Test Loss: 0.5009856186807156
Step 37400/50000, Loss: 0.4641845437884331, Test Loss: 0.5010005459189415
Step 37500/50000, Loss: 0.45898341059684755, Test Loss: 0.501035176217556
Step 37600/50000, Loss: 0.46821809977293016, Test Loss: 0.5010434314608574
Step 37700/50000, Loss: 0.4645710316300392, Test Loss: 0.5010265596210957
Step 37800/50000, Loss: 0.4640098667144775, Test Loss: 0.5010393038392067
Step 37900/50000, Loss: 0.46412052005529403, Test Loss: 0.5010656118392944
Step 38000/50000, Loss: 0.46380079209804537, Test Loss: 0.5010442435741425
Step 38100/50000, Loss: 0.45950564950704575, Test Loss: 0.5010580159723759
Step 38200/50000, Loss: 0.4583196929097176, Test Loss: 0.5010528229176998
Step 38300/50000, Loss: 0.4578334194421768, Test Loss: 0.5010626949369907
Step 38400/50000, Loss: 0.4641471928358078, Test Loss: 0.5010474137961864
Step 38500/50000, Loss: 0.45832279920578, Test Loss: 0.5011086650192738
Step 38600/50000, Loss: 0.4596054396033287, Test Loss: 0.5011686645448208
Step 38700/50000, Loss: 0.4553371977806091, Test Loss: 0.5011721514165401
Step 38800/50000, Loss: 0.4574957764148712, Test Loss: 0.5011417493224144
Step 38900/50000, Loss: 0.457391936480999, Test Loss: 0.5011355429887772
Step 39000/50000, Loss: 0.4574935802817345, Test Loss: 0.5010716021060944
Step 39100/50000, Loss: 0.4593946158885956, Test Loss: 0.5011395439505577
Step 39200/50000, Loss: 0.4579893064498901, Test Loss: 0.5011791661381721
Step 39300/50000, Loss: 0.45883000135421753, Test Loss: 0.5011468753218651
Step 39400/50000, Loss: 0.45464896500110624, Test Loss: 0.5011775493621826
Step 39500/50000, Loss: 0.4580179151892662, Test Loss: 0.5011567324399948
Step 39600/50000, Loss: 0.45292833030223845, Test Loss: 0.5012234374880791
Step 39700/50000, Loss: 0.4537820702791214, Test Loss: 0.5013137497007847
Step 39800/50000, Loss: 0.45478583812713624, Test Loss: 0.5013053864240646
Step 39900/50000, Loss: 0.4497983455657959, Test Loss: 0.5013463273644447
Step 40000/50000, Loss: 0.4540665075182915, Test Loss: 0.5013807415962219
Step 40100/50000, Loss: 0.45197425842285155, Test Loss: 0.5013848207890987
Step 40200/50000, Loss: 0.4500792470574379, Test Loss: 0.5014711953699589
Step 40300/50000, Loss: 0.453069281578064, Test Loss: 0.5015005990862846
Step 40400/50000, Loss: 0.4506500625610352, Test Loss: 0.501548171043396
Step 40500/50000, Loss: 0.45183879375457764, Test Loss: 0.5015650875866413
Step 40600/50000, Loss: 0.4543327933549881, Test Loss: 0.5015492886304855
Step 40700/50000, Loss: 0.4508472245931625, Test Loss: 0.5015063136816025
Step 40800/50000, Loss: 0.45093833416700363, Test Loss: 0.5015206262469292
Step 40900/50000, Loss: 0.44664413928985597, Test Loss: 0.5015188828110695
Step 41000/50000, Loss: 0.451400865316391, Test Loss: 0.5015908963978291
Step 41100/50000, Loss: 0.44933731645345687, Test Loss: 0.5015690959990025
Step 41200/50000, Loss: 0.4465946346521378, Test Loss: 0.50167266279459
Step 41300/50000, Loss: 0.4445001712441444, Test Loss: 0.5017509907484055
Step 41400/50000, Loss: 0.4418393585085869, Test Loss: 0.5018127113580704
Step 41500/50000, Loss: 0.44124255418777464, Test Loss: 0.5018390603363514
Step 41600/50000, Loss: 0.4386308166384697, Test Loss: 0.5019795894622803
Step 41700/50000, Loss: 0.44028086751699447, Test Loss: 0.5020023062825203
Step 41800/50000, Loss: 0.4352584308385849, Test Loss: 0.5022229515016079
Step 41900/50000, Loss: 0.4381945076584816, Test Loss: 0.5022110231220722
Step 42000/50000, Loss: 0.46495745092630386, Test Loss: 0.5017723441123962
Step 42100/50000, Loss: 0.46158114582300186, Test Loss: 0.5016018971800804
Step 42200/50000, Loss: 0.4646707597374916, Test Loss: 0.5013327859342098
Step 42300/50000, Loss: 0.4723111265897751, Test Loss: 0.5011717230081558
Step 42400/50000, Loss: 0.46732883155345917, Test Loss: 0.5010780096054077
Step 42500/50000, Loss: 0.4692428630590439, Test Loss: 0.5010198876261711
Step 42600/50000, Loss: 0.4681865236163139, Test Loss: 0.5010362900793552
Step 42700/50000, Loss: 0.4688673847913742, Test Loss: 0.5010101981461048
Step 42800/50000, Loss: 0.46303324073553087, Test Loss: 0.5010134018957615
Step 42900/50000, Loss: 0.46190293282270434, Test Loss: 0.500985860824585
Step 43000/50000, Loss: 0.4691228488087654, Test Loss: 0.500946830958128
Step 43100/50000, Loss: 0.4677538934350014, Test Loss: 0.5009460374712944
Step 43200/50000, Loss: 0.46625195145606996, Test Loss: 0.5008740350604057
Step 43300/50000, Loss: 0.4640511643886566, Test Loss: 0.5009244792163372
Step 43400/50000, Loss: 0.4665636906027794, Test Loss: 0.5008159950375557
Step 43500/50000, Loss: 0.4671597841382027, Test Loss: 0.5007410608232021
Step 43600/50000, Loss: 0.4641227728128433, Test Loss: 0.5007301419973373
Step 43700/50000, Loss: 0.4633761635422707, Test Loss: 0.5007203817367554
Step 43800/50000, Loss: 0.46214351028203965, Test Loss: 0.5007463805377483
Step 43900/50000, Loss: 0.46062709659337997, Test Loss: 0.5006855353713036
Step 44000/50000, Loss: 0.4634872642159462, Test Loss: 0.5007058121263981
Step 44100/50000, Loss: 0.46471895784139633, Test Loss: 0.5007161721587181
Step 44200/50000, Loss: 0.4659980982542038, Test Loss: 0.5006871521472931
Step 44300/50000, Loss: 0.46030958235263825, Test Loss: 0.5007446706295013
Step 44400/50000, Loss: 0.4600467967987061, Test Loss: 0.5007464215159416
Step 44500/50000, Loss: 0.4673938220739365, Test Loss: 0.5007539130747318
Step 44600/50000, Loss: 0.46498212337493894, Test Loss: 0.5007070824503899
Step 44700/50000, Loss: 0.46189157575368883, Test Loss: 0.5007654540240765
Step 44800/50000, Loss: 0.46281050354242326, Test Loss: 0.5007943734526634
Step 44900/50000, Loss: 0.4639808592200279, Test Loss: 0.5007562525570393
Step 45000/50000, Loss: 0.4576499903202057, Test Loss: 0.5007820129394531
Step 45100/50000, Loss: 0.4534341612458229, Test Loss: 0.5007734969258308
Step 45200/50000, Loss: 0.4638801547884941, Test Loss: 0.5007734447717667
Step 45300/50000, Loss: 0.4590744495391846, Test Loss: 0.5007826164364815
Step 45400/50000, Loss: 0.4599402379989624, Test Loss: 0.5008482784032822
Step 45500/50000, Loss: 0.45705058693885803, Test Loss: 0.5009209364652634
Step 45600/50000, Loss: 0.4559430235624313, Test Loss: 0.5008799694478512
Step 45700/50000, Loss: 0.45830257862806323, Test Loss: 0.500896867364645
Step 45800/50000, Loss: 0.45596627026796344, Test Loss: 0.5008640959858894
Step 45900/50000, Loss: 0.45789127141237257, Test Loss: 0.5007930248975754
Step 46000/50000, Loss: 0.4571463504433632, Test Loss: 0.5008691176772118
Step 46100/50000, Loss: 0.4582295683026314, Test Loss: 0.5009022168815136
Step 46200/50000, Loss: 0.4579246589541435, Test Loss: 0.5008432939648628
Step 46300/50000, Loss: 0.4534035176038742, Test Loss: 0.50087920576334
Step 46400/50000, Loss: 0.45722023576498033, Test Loss: 0.5008842125535011
Step 46500/50000, Loss: 0.4517281222343445, Test Loss: 0.500956941395998
Step 46600/50000, Loss: 0.4535133907198906, Test Loss: 0.5010455660521984
Step 46700/50000, Loss: 0.4543594112992287, Test Loss: 0.5010179914534092
Step 46800/50000, Loss: 0.4521429255604744, Test Loss: 0.5010643899440765
Step 46900/50000, Loss: 0.4514783936738968, Test Loss: 0.5010826289653778
Step 47000/50000, Loss: 0.452489273250103, Test Loss: 0.5011294670403004
Step 47100/50000, Loss: 0.44850277453660964, Test Loss: 0.5011962167918682
Step 47200/50000, Loss: 0.45421678364276885, Test Loss: 0.5012331902980804
Step 47300/50000, Loss: 0.45029215425252916, Test Loss: 0.5012377202510834
Step 47400/50000, Loss: 0.4513078424334526, Test Loss: 0.5012977793812752
Step 47500/50000, Loss: 0.45403604418039323, Test Loss: 0.5012771934270859
Step 47600/50000, Loss: 0.4494051992893219, Test Loss: 0.5012643933296204
Step 47700/50000, Loss: 0.44979759931564334, Test Loss: 0.5012452751398087
Step 47800/50000, Loss: 0.4478905948996544, Test Loss: 0.501251682639122
Step 47900/50000, Loss: 0.45096522599458694, Test Loss: 0.5012961141765118
Step 48000/50000, Loss: 0.4485600805282593, Test Loss: 0.5013213120400906
Step 48100/50000, Loss: 0.443496415913105, Test Loss: 0.5014169104397297
Step 48200/50000, Loss: 0.4468724298477173, Test Loss: 0.5014543868601322
Step 48300/50000, Loss: 0.43916097432374956, Test Loss: 0.501509003341198
Step 48400/50000, Loss: 0.4395106253027916, Test Loss: 0.5015630647540092
Step 48500/50000, Loss: 0.44071861177682875, Test Loss: 0.5016614384949207
Step 48600/50000, Loss: 0.4379118290543556, Test Loss: 0.5017481371760368
Step 48700/50000, Loss: 0.4357627189159393, Test Loss: 0.5018959939479828
Step 48800/50000, Loss: 0.44376066118478774, Test Loss: 0.501755379140377
Step 48900/50000, Loss: 0.4674203127622604, Test Loss: 0.501388244330883
Step 49000/50000, Loss: 0.4611475524306297, Test Loss: 0.5012223459780216
Step 49100/50000, Loss: 0.4634071630239487, Test Loss: 0.500988133251667
Step 49200/50000, Loss: 0.4706606161594391, Test Loss: 0.5008767060935497
Step 49300/50000, Loss: 0.4656855249404907, Test Loss: 0.500786330550909
Step 49400/50000, Loss: 0.4685150933265686, Test Loss: 0.5007378049194813
Step 49500/50000, Loss: 0.46967413753271103, Test Loss: 0.5007382407784462
Step 49600/50000, Loss: 0.4655856826901436, Test Loss: 0.5007290728390217
Step 49700/50000, Loss: 0.460963761806488, Test Loss: 0.5007485263049603
Step 49800/50000, Loss: 0.468090540766716, Test Loss: 0.5006921850144863
Step 49900/50000, Loss: 0.46244903177022934, Test Loss: 0.5006965585052967
Step 50000/50000, Loss: 0.46912035673856733, Test Loss: 0.500646710395813
if use_existing_model:
    print("Existing model used, no loss curves shown.")
    plt.imshow(plt.imread("./sft_loss_curve.png"))
else:
    plt.figure(figsize=(10, 6))
    plt.plot(losses, label="Train Loss", color='blue')
    plt.plot(test_losses, label="Test Loss", color='red')
    plt.xlabel('Checkpoint')
    plt.ylabel('Loss')
    plt.title('Supervised Fine Tuning - Training and Test Loss Over Time')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend()
    plt.show()
Image Output
if not use_existing_model:
    torch.save(model, f"./sft_final.pth")

2.2 Inference with Fine Tuned Model

With the fine tuned model, we can perform a more natural form of interence. Instead of formatting all of our prompts as next token prediction, we can have a more natural Q&A style format with the model

We are using a very small model and a very small set of data compared to modern LLMs, so our model is not going to perform very well on most questions. However, it is outputting responses that are at least related to the prompt and are formatted in a correct way. It is very cool to see the LLM starting to come together! As we scale up the model, data, etc... the responses will become more factual, realistic, and contextually accurate. At this point, the majority of the responses are hallucinations.

def sft_inference(prompt,torch_model, max_new_tokens):
    torch_model.eval()
    prompt = "<Question>" + prompt + "</Question>" + "<Answer>" # Wrap the prompt in <Question> and start inference with <Answer>
    with torch.no_grad(): 
        tokens = hf_tokenizer.encode(prompt) # Tokenize the prompt
        for _ in range(max_new_tokens):
            if tokens[-1] == hf_tokenizer.eos_token_id: # Stop if we reach the end of the sequence
                break
            num_tokens = len(tokens) # 
            tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens) # pad the sequence with eos token
            tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device) 
            logits = torch_model(tokens_padded) # Forward pass through the model
            probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1) # Get the probabilities of the last token
            predicted_token = torch.argmax(probabilities).item() # Greedy decoding, change to sampling for more diversity
            tokens.append(predicted_token)
        
        # Strip the text to between the <Answer></Answer> tags
        full_answer = hf_tokenizer.decode(tokens)
        answer_start = full_answer.find("<Answer>") + len("<Answer>")
        answer_end = full_answer.find("</Answer>")
        return full_answer[answer_start:answer_end]
print("Predicted:", sft_inference("Who is the most powerful leader in the west?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the sun?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the ocean", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many planets are in the solar system", model, max_new_tokens=20))
print("Predicted:", sft_inference("What three countries are in north america?", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many eyes do humans have?", model, max_new_tokens=20))
Predicted: Theodore Roosevelt
Predicted: Yellow
Predicted: Red
Predicted: About 20,000 planets?
Predicted: United States and Canada
Predicted: Two eyes

Sources