A good resource to use alongside this notebook is the original GPT paper [1]. This notebook largely relies on that paper for model architectures and implementation.
This article will walk through building a simple GPT style model from scratch using pytorch [1,2]. The goal of this article is to train a basic large language model from start to finish in one notebook. We will train an LLM that is small enough to fit in a single GPU during training and inference, so the notebook can be run in popular cloud GPU services (Google Colab, Kaggle, Paperspace, etc...). The computation graph of the model that we will build in this article is as follows:
This architecture resembles the original GPT model, and is quite similar to GPT2 and GPT3, with the main difference being that it is smaller (less decoder blocks and smaller embedding sizes) [1,3,4]. We will zoom into each step of this diagram throughout this article to discuss the math, code, and intuition behind them.
According to the original GPT paper, there are two main training stages for the early GPT models, pretraining and supervised fine tuning [1]. Pretraining is a self supervised learning task, where parts of the input data are omitted and used as target variables. Self supervised fine tuning works similar to traditional supervised learning tasks, with human annoted labels for input data.
The first stage in building a GPT model is pretraining. Pretraining builds the "base" of an LLM. It allows the model to understand statistical properties of language, grammar, and context.
The goal of pretraining is simple: to have a model that can reliably predict the next token given the previous k tokens in a sequence. The final result of pretraining is a deep learning model that takes in $k$ tokens and produces a discrete probability distribution of what the $k+1$ token should be. We want this distribution to show a high value for the correct token and low values for the incorrect ones.
To achieve this, we start off with a large dataset of raw text. This text can be taken from books, blogs, wikis, research papers, and other text sources. After compiling the large dataset of text, we split the dataset into "chunks" of tokens, where each chunk has a certain amount of tokens (512 gpt, 1024 gpt2, 16385 gpt-3). This chunk size is known as the "context window". A pretrained model will take in that many tokens, and output the most likely next token.
When dealing with LLMs we use the word "token" to describe the smallest "unit" of text that an LLM can analyze [5]. Tokens can generally be thought of as words conceptually. When analyzing a sequence of text, an LLM first has to convert the text to tokens. This is similar to a dictionary lookup, each word/token will have an integer "index" in the lookup. This index is what will actually be fed into the network to be analyzed.
Each example of the pretraining dataset is a chunk of tokens. The same chunk of tokens is used for the input and output, but the output is shifted 1 token into the "future". The reason for this has to do with the parallel processing capabilities of the transformer, which we will go into depth further in the transformer section. The following visual helps show what the training data looks like for the pretraining model.
Because the model uses transformers and parallel processing, a single example like the one above is actually in a sense 6 different examples. The model is learning the following predictive patterns:
This will be clearer in the transformer section of the article. The main point to know now is what the format of the input and outputs of the training data should look like in the pretraining step. The outputs are the inputs, shifted by one token so that each input token aligns with the output token that comes directly after it in the original sequence.
Before doing a full pre-training loop, we will do a "test run" using a small dataset we can fit in to memory. This will allow us to focus on the internals of the model rather than complexities of data processing. We can use the Salesforce wikitext dataset that consists of an extract of good and featured wikipedia articles [6].
We will load the dataset from the huggingface datasets hub. The huggingface datasets package provides an easy way to load, preprocess, and use a variety of datasets for deep learning [7].
import warnings
import torch
import math
import time
import os
import matplotlib.pyplot as plt
from itertools import cycle
from datasets import Dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from torch.optim.lr_scheduler import _LRScheduler
warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)cuda
from datasets import load_dataset
dataset = load_dataset("EleutherAI/wikitext_document_level", "wikitext-2-raw-v1", split="train")For pretraining language models, a simple approach to tokenizing and chunking text is as follows:
This process will change slightly when using datasets that are too large to fit into memory.
One easy way to tokenize our dataset is to use OpenAI's tokenizer implementation tiktoken for BPE (Byte Pair Encoding) [8]. This article will not go into detail on how the implementation of a tokenizer works, but just know that it converts strings of text into lists of integers, and can also convert the lists of integers back into strings of texts.
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2") # Get the same tokenizer used for GPT-2
print("Vocabulary size:", tokenizer.n_vocab) # Vocabilary size is how many unique tokens the tokenizer can encode
print("End of text token:", tokenizer.eot_token) # End of text token is used to indicate the end of a text sequence
print("Example tokenization:", tokenizer.encode("Hello world!"))
# Convert entire dataset into a single string
# This dataset is small enough to fit into memory
# For larger datasets, you may need to use more
# sophisticated methods to process the data.
all_text = ""
all_data = dataset["page"]
for example in all_data:
all_text += "<page> "+ example + " </page>"
# Tokenize the entire text at once
tokenized_text = tokenizer.encode(all_text)
# We will create a function that generates a dataset of examples
# for the language model. The function will take in the number of
# examples to generate, the block size, and the test split.
# It will return the training and test datasets.
def get_dataset(num_examples, context_window_length, test_split=0.1):
input_blocks = [] # List to store input sequences
target_blocks = [] # List to store target sequences
# Use a sliding window to create input/target sequences
for i in range(0, len(tokenized_text), context_window_length + 1):
block = tokenized_text[i:i+context_window_length+ 1]
# Skip blocks that are too short
if len(block) < context_window_length + 1:
continue
input_seq = block[:-1]
target_seq = block[1:]
input_blocks.append(input_seq)
target_blocks.append(target_seq)
# Stop if we have enough examples
if len(input_blocks) >= num_examples:
break
# Convert to tensors for pytorch and move to gpu
inputs = torch.tensor(input_blocks, dtype=torch.long).to(device)
targets = torch.tensor(target_blocks, dtype=torch.long).to(device)
# Calculate train/test split point
split_idx = int(num_examples * (1 - test_split))
# Split into train/test
train_inputs = inputs[:split_idx]
train_targets = targets[:split_idx]
test_inputs = inputs[split_idx:]
test_targets = targets[split_idx:]
return train_inputs, train_targets, test_inputs, test_targets
# Get a small dataset
i, o, _, _ = get_dataset(2, 4, 0)
print("Input Shape", i.shape)
print("Output Shape", o.shape)
print("Input Example:")
print(i)
print("Output Example:")
print(o)
Vocabulary size: 50257
End of text token: 50256
Example tokenization: [15496, 995, 0]
Input Shape torch.Size([2, 4])
Output Shape torch.Size([2, 4])
Input Example:
tensor([[ 27, 7700, 29, 220],
[ 569, 18354, 7496, 17740]], device='cuda:0')
Output Example:
tensor([[ 7700, 29, 220, 796],
[18354, 7496, 17740, 6711]], device='cuda:0')
Using our tokenizer methods, we have generated a "dummy" dataset that will be used for the rest of the diagrams / examples of the article to show the shapes of the matrices as they flow through the model.
This means that we have a context length of 4 tokens, and a batch size of 2. The full dummy dataset has a total of 2 examples. This is far smaller than the dataset would be in reality - but is useful for introducing the architecture.
Now that we have a small dummy dataset. We can build our LLM model architecture in pytorch.
First, we can build a "config" object that will store our parameters for the network. We will go through each parameter in depth later on in the network.
import torch
import torch.nn as nn
import torch.nn.functional as F
# A simple configuration container
class GPTConfig:
def __init__(
self,
vocab_size, # size of the vocabulary, from tokenizer, for gpt2 tokenizer it is 50257
n_layer, # number of transformer blocks
n_head, # number of attention heads for each transformer block
n_embd, # embedding dimension for each token
seq_len, # sequence length for the model - e.g. the "context window"
):
self.vocab_size = vocab_size
self.n_layer = n_layer
self.n_head = n_head
self.n_embd = n_embd
self.seq_len = seq_len
test_config = GPTConfig(
vocab_size=tokenizer.n_vocab,
n_layer=2,
n_head=3,
n_embd=6,
seq_len=4,
)Our first layer of the network is going to be a token embedding layer. This layer is a little bit different than traditional neural network layers. It is essentially a lookup table that returns an "embedding vector" for a given integer index. The goal of this layer is to convert tokens to vectors. These vectors are tuned as the network is trained so that their position in space relative to the other tokens reflects their statistical relationships with each other.
The embedding layer converts a discrete token (integer) into a semantic representation of that token (vector). Before the embedding layer, the model has no idea of what the token means or how it relates to other tokens. After the embedding layer, the model understands the semantic meaning of the token by its relationship with other tokens in the embedding space. For more information on word embeddings see the Word2Vec paper [13].
These are vectors that start off as random, but slowly assume values within embedding space that reflect the semantic meaning of the token. This process happens during training.
For our dummy dataset, the input to this layer will be a matrix of size $2x4$, batch x token indices. The output will be $2x4x6$, batch x tokens x embedding dimensions. This transformation can be visuzlized as follows:
token_embedding = nn.Embedding(test_config.vocab_size, test_config.n_embd).to(device)
test_batch_inputs, _, _, _ = get_dataset(2, test_config.seq_len, 0)
print("Batch shape:", test_batch_inputs.shape, "Batch x Seq Len")
print("After embedding:", token_embedding(test_batch_inputs).shape, "Batch x Seq Len x Embedding Dim")
print("")
print("Before embedding")
print(test_batch_inputs)
print("After embedding")
print(token_embedding(test_batch_inputs))Batch shape: torch.Size([2, 4]) Batch x Seq Len
After embedding: torch.Size([2, 4, 6]) Batch x Seq Len x Embedding Dim
Before embedding
tensor([[ 27, 7700, 29, 220],
[ 569, 18354, 7496, 17740]], device='cuda:0')
After embedding
tensor([[[ 0.7290, -0.2958, -1.0399, 1.4077, 0.7276, 1.1554],
[-0.5482, -0.3365, -0.1113, 1.3904, 1.6721, -1.5533],
[ 0.0291, -1.3123, 1.4436, 0.7401, 1.1435, 1.1597],
[-0.5509, 1.1057, 1.5446, 0.7508, 0.4335, 1.8201]],
[[ 0.6803, -0.9699, -1.0296, 1.4327, 0.0629, 1.0485],
[ 0.5021, 0.6006, 0.7069, -0.3284, -0.2663, -1.5875],
[ 0.6808, 1.7683, 0.8311, -0.3728, -0.1172, -0.0622],
[ 0.9335, -0.1899, -1.4040, 0.4846, 0.6599, 0.7488]]],
device='cuda:0', grad_fn=<EmbeddingBackward0>)
In this example, we are using an embedding dimension of 6, so each original token is mapped to a vector of length 6. As of right now, these vectors don't have any actual meaning, they are randomly initialized. However, during the training process, these entries will be slowly nudged via backpropagation and over time they will start to assume meaning for their respective tokens.
After embedding the tokens into embedding vectors, we will add a positional encoding to the vectors. Why do we need a positional encoding? Consider the following sentence:
The planet is smaller than the other planet.
A positional encoding allows the model to differentiate the two instances of the word "planet". Without a positional encoding, the two token embedding vectors for each instance of the word planet would be exactly the same. Having a positional encoding allows the model to differentiate the two usages within the same instance.
We will use the positional encoding formula that was used in the original transformer paper [9]. The formula works by starting out with a matrix of shape sequence length x embedding dimension. The matrix is then filled in with the following formula:
$$PE(POS,2i) = sin(\frac{pos}{10000^\frac{2i}{d}})$$ $$PE(POS,2i+1) = cos(\frac{pos}{10000^\frac{2i}{d}})$$
Where $POS$ is the position of the token in the sequence, i is the index of the embedding dimension within the token, and d is the embedding dimension size of the model. This entire formula outputs a matrix, and the matrix that it outputs is dependent on the embedding size. The resulting matrix will be (seq_length x embedding size). The matrix starts out as all zeros, and then the formula is applied.
def get_position_encoding(seq_len, d, n=10000):
"""
Computes the positional encoding matrix of shape (seq_len, d).
Args:
seq_len (int): Length of the sequence.
d (int): Dimension of the embedding.
n (float): The base for the exponential term (default 10000 in many Transformer implementations).
Returns:
torch.Tensor: A tensor of shape (seq_len, d) containing the positional encodings.
"""
P = torch.zeros(seq_len, d).to(device)
for pos in range(seq_len):
for i in range(0, d // 2):
P[pos, 2 * i] = math.sin(pos / (n ** ((2 * i) / d)))
if i + 1 < d:
P[pos, 2* i + 1] = math.cos(pos / (n ** ((2 * i) / d)))
return P.unsqueeze(0)
# Example usage:
position_encoding = get_position_encoding(seq_len=test_config.seq_len, d=test_config.n_embd)
print("Position encoding shape:", position_encoding.shape)
Position encoding shape: torch.Size([1, 4, 6])
Once we have the positional encoding, we add that using element wise addition to the embedding vectors. Since we are using pytorch, the addition will "broadcast" across the first dimension. This means that the 4x6 positional encoding matrix will be added to each batch example in parallel.
test_embeddings = token_embedding(test_batch_inputs)
test_embeddings_with_pos = test_embeddings + position_encoding
print("Token embeddings shape:", test_embeddings.shape)
print("Position encodings shape:", position_encoding.shape)
print("Sum of token embeddings and position encodings:",test_embeddings_with_pos.shape)
Token embeddings shape: torch.Size([2, 4, 6]) Position encodings shape: torch.Size([1, 4, 6]) Sum of token embeddings and position encodings: torch.Size([2, 4, 6])
At first, it can be a challenging to intuit what the positional encoding is doing. The positional encoding is just a constant matrix (given the sequence length and embedding size), with the values set to a desirable pattern. Each row of the matrix aligns to a token, meaning a constant vector will be added to the token at position 1 every time, and a different constant vector added to the token at position 2 every time, etc...
This differentiates the value of the word "planet" coming at the beginning vs the end of the sentence. However, sometimes relative position of words in a sentence is more important than absolute position. So how do we take that into account? The answer is that the relative relationships between words are emergent. These happen through the process of attention, which we will discuss later.
The key point here is that without positional encoding, these two sentences would look the same:
The positional encoding makes the vectors for dog and owner different in the two sentences, which allows attention to catch onto the relative relationships between these two words.
The below image shows an example of a positional encoding matrix. It looks interesting but what exactly are we looking at? Why does this help the model encode the position of each embedding vector. Remember, each row in our embedding vector represents a word/token. We will be adding this matrix to the embedding matrix to encode positions. One thing to note about this matrix is that each row is unique. There is also a smooth transition between each row. If you take rows 27 and 28 from this matrix, they are going to have very similar patterns. However if you take rows 1 and 120 from this matrix, they are going to differ much more. This smoothness is also an important feature that helps the model understand position [10].
There is nothing inherently special about the formula above, there are other formulas for positional encoding. The key thing to note is that there needs to be some matrix that we can add to our embedding matrix that encodes position. This formula has certain properties that are biased towards making it easy for the model to do that.
After positional encoding, we get to the core of the LLM - the (decoder only) transformer. The first step of the transformer is masked multiheaded self attention. We can break down the internals of the transformer into three parts: self attention, then masking, then the multiple heads.
The core idea behind self attention is that it allows every token to "talk" to the other tokens. Attention "reframes" a word's meaning into a combination of all the other words in the context window. A single self attention head does one of many possible "reframings" of each token. It allows for the model to understand a each word's context in relation to the other words of the sentence.
Self attention starts with just the token embedding matrix with position encodings. It "decomposes" this matrix into queries, keys, and values. In reality all of these are just vectors / matrices that get tuned during training, but we can conceptually think of them as queries, keys, and values due to their dot product operations that take place in the attention operation.
The original equation for scaled dot product attention is as follows [9]: $$Attention(Q,K,V)=softmax(\frac{QK^t}{\sqrt{d_k}})V$$
Q, K, and V are query, key, and value matrices. They are set initially through matrix projections of the input embedding matrix. The token embeddings are multplied by $W_q$, $W_k$, and $W_v$ matrices. These weight matrices start off as random and are tuned during the process of training the network. Meaning during training, the network learns what "queries" to ask, and what "keys" and "values" to set via backpropagation by tuning these matrices. It learns how to transform the embedding matrix into "keys", "queries", and "values" in order to best reduce the loss of the network.
The projection operation to generate Q,K, and V are shown below using the dimensions for our dummy dataset/network.
Q, K, and V are all matrices that are of shape num tokens x embedding size. Each token has a query vector in "query space". Each token also has a key vector in "key space". When we do the $QK^T$ operation, we are calculating how well each token query matches each key. This could be thought of as sort of a "fuzzy lookup" using vector dot products. If the query and key have a high dot product, that means the vectors are pointing in a direction near each other. This also means those two tokens are important to take into account together.
After doing the matrix multiplication between $Q$ and $K^T$, we end up with a similarity matrix of tokens. This similarity matrix tells us how much each token attends to each other token. Each row of the $QK^T$ matrix is put through the softmax function so each row becomes a probability distribution that adds to one. This probability distribution can be interpreted as how strong of a match each key is to the query of the row. How much each key "attends" to each query.
The value matrix can be thought of the actual content/information that the each token has to offer. This value matrix is weighted by the similarities of the keys/queries to produce the final output of self attention.
There are some alternative ways to conceive of the individual operations of attention that can help at a conceptual / intuitive level to know what the network is doing. Let's go through each operation in attention and try to simplify down in english what it is doing at a conceptual level.
We know that the $Q$, $K$, $V$ matrices are created by a matrix operation to the input of the transformer (for the first block, this is our position encoded word embeddings). We also know that the weights to create these matrices are tuned through the process of backpropagation. But how can we think of these matrices themselves? What information do they actually contain?
The $Q$ matrix can be thought of as n rows of queries or questions, where n is the number of tokens in the input. When thinking about the $Q$ matrix, think of it as n vectors instead of a single matrix. Where each vector is a query or question about the corresponding word that could be answered by some combinations of the other words. Remember, we are "reframing" the give word as some combination of the other words. For example it could look like the following:
In this case each token has a corresponding question. These questions or queries are going to be questions that can be answered by the surrounding tokens. So how are these questions created? $W_q$ is responsible for creating the right questions for each token (with position). $W_q$ maps a token to a relevant query about that token. These queries become relevant through the process of training via backpropagation.
We can think of the $K$ matrix as n row vectors of keys, where n is the number of tokens in the input. What do we mean by "keys". It is easiest to think of keys as facts that can help answer queries. Above in the query section we asked questions like "what noun do I describe?". A key that might closely match this query would be "I am a noun that can be described". Similar to the queries, $W_k$ creates these keys by learning the right mapping from token to corresponding key. These keys are good matches for the queries becuase of the $QK^T$ operation that is performed in training.
Overall, each key can be conceived of as a fact about that token that could help answer a queries that the other tokens might have.
Now that we have an intuition of the $Q$ and $K$ matrix, we can think about what the matrix multiplication operation $QK^T$ in the attention equation is doing. The $QK^T$ operation is a matching operation, where each query is compared with each key, by performing a dot product operation. If the dot product is large, that means that the key answers or "attends" to the query. If the dot product is small, that means the key is unrelated and does not help answer the query. The $QK^T$ operation "reframes" each query into a set of keys. The resulting matrix of the operation can be thought of as n row vectors. Every dimension or coordinate of these row vectors is a weight for a token key/fact. So a vector in this space is some weighted combination of all of the tokens (keys).
Basically, what we are doing is redescribing the original token query/question as a weighted vector of all of the token keys/answers. Instead of asking a question about of token, we have n different answers, all with their own weights.
When doing the $QK^T$ operation, we are reframing the query row vectors to a combination of the keys. Remember each query has to do with how that token relates to the other tokens, so the answers can be formed as some combination of the other tokens.
This operation is done to make the output of the softmax more stable. The dot product of two random vectors of dimension $d_k$ results in values that tend to grow proportionally to $d_k$. This ensures that no matter, how large $d_k$ is, the softmax works as expected and does not result in extreme values.
This is an elementwise division so every element of the matrix is divided by this value. The resulting matrix can be thought of in the same way as the $QK^T$ result, just scaled.
The softmax operation is performed row-wise on the $\frac{QK^T}{\sqrt{d_k}}$ matrix. This means every row results in a probability distribution. We can still think of this as each token is represented as a "reframed" query vector, but now we know that each row vector adds up to one.
The $V$ matrix is a bit hard to conceive of, but can be thought of as a column matrix, where each column is a learned feature, and each element of those vectors is the value of that feature for the token in that row. They are "feature" vectors, that contain information about specific learned features for each token. When we do the final operation, these feature vectors will be weighted, meaning that the values of these features for certain tokens on should be focused on more than other tokens. The $V$ matrix is the actual content or output of attention. This content will be adjusted by the weights from the $softmax(\frac{QK^T}{\sqrt{d_k}})$ operation
Now for the final operation of attention, multiplying by the $V$ matrix. We can think of the V matrix as containing the original content of the embeddings. We weight this content based on the query/key matches. In other words, we weight the content based on the specific questions we are trying to ask and how the other words in context answer those questions.
$$softmax(\frac{QK^T}{\sqrt{d_k}})V$$
When putting this all together (using the original dimensions of our "test" config object as we are in the code), we can see what all the matrix operations and dimensions through the self attention operation are.
Self attention can be written as a self contained pytorch module as shown below.
class SelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values
def forward(self, x):
print("Attention input shape:", x.shape)
print("")
print("Query weights shape:", self.Wq.shape)
print("Key weights shape:", self.Wk.shape)
print("Value weights shape:", self.Wv.shape)
queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
keys = x @ self.Wk # Matrix multiplication to transform input embeddings into keys
values = x @ self.Wv # Matrix multiplication to transform input embeddings into values
print("")
print("Queries shape:", queries.shape)
print("Keys shape:", keys.shape)
print("Values shape:", values.shape)
qkt = queries @ keys.transpose(-2, -1) # Calculate QK^T
qkt_scaled = qkt / math.sqrt(queries.size(-1)) # Scale QK^T by the dimension of the keys
qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights
print("")
print("QK^T shape:", qkt.shape)
attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
print("")
print("Attention output shape:", attn_output.shape)
return attn_output
attention = SelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)Attention input shape: torch.Size([2, 4, 6]) Query weights shape: torch.Size([6, 6]) Key weights shape: torch.Size([6, 6]) Value weights shape: torch.Size([6, 6]) Queries shape: torch.Size([2, 4, 6]) Keys shape: torch.Size([2, 4, 6]) Values shape: torch.Size([2, 4, 6]) QK^T shape: torch.Size([2, 4, 4]) Attention output shape: torch.Size([2, 4, 6])
Now that we have implemented self attention, we can move on to causal self attention. During training, we are trying to predict the next token at each time step in parallel in the transformer. However, we will be cheating if we allow attention to see future tokens during the training process. It will just predict the future tokens by looking at them. For this reason we need to mask the matrices so that future tokens are hidden from self attention layers. We perform this masking after the $QK^T$ operation [11].
The masking process makes the output of the softmax operation 0 in the upper right corner. This makes it to where the following occurs:
When we say the query is able to be reframed, what we mean mathematically is that the value in that matrix entry could possibly be over 0.
We can modify our self attention block above to add masking with the following changes:
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.Wq = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Query weights - will transform input embeddings into queries
self.Wk = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Key weights - will transform input embeddings into keys
self.Wv = nn.Parameter(torch.randn(config.n_embd, config.n_embd)).to(device) # Value weights - will transform input embeddings into values
def forward(self, x):
seq_len = x.shape[1] # Get sequence length (number of tokens / context window length)
queries = x @ self.Wq # Matrix multiplication to transform input embeddings into queries
keys = x @ self.Wk # Matrix multiplication to transform input embeddings into keys
values = x @ self.Wv # Matrix multiplication to transform input embeddings into values
qkt = queries @ keys.transpose(-2, -1) # Calculate QK^T
qkt_scaled = qkt / math.sqrt(queries.size(-1)) # Scale QK^T by the dimension of the keys
# MASKING
# THIS IS THE ONLY DIFFERENCE, USE -inf FOR UPPER TRIANGLE MASK SO THAT SOFTMAX WILL BE 0
causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1)
causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf')) # Upper triangle masked with -inf
qkt_scaled = qkt_scaled + causal_mask # Add the mask to the scaled QK^T
# END MASKING
qkt_softmax = F.softmax(qkt_scaled, dim=-1) # Apply softmax row-wise to get attention weights, the -inf values will become 0 here
attn_output = qkt_softmax @ values # Multiply softmax(QK^T) by values
return attn_output
attention = CausalSelfAttention(test_config)
test_out = attention(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)torch.Size([2, 4, 6])
Now we have causal self attention, we can add in the "multi-headed" part of the attention layer. Multi headed attention splits the attention operation in parallel, allowing multiple "heads" to each have their own learned QKV weights.
What is this actually doing conceptually? It is allowing each head to have the tokens attend to each other in different ways. For instance one head might be focusing on grammatical structure, another might be focusing on semantic meaning, while another based on real-world meaning. If viewing the sentence "the sky is blue" from a grammatical structure perspective, the word "the" might attend to the word "sky" heavily becuase that is what it is referring to. However if viewing attention through the lense of real-world meaning, the word "the" won't attend to the word "sky" very much becuase their meanings are not similar. Each word's relationship to the other words might be different depending on what "lens" (or "head") you are viewing them through.
To reiterate, this is a helpful conceptual way to think about multi-headed attention, but the meanings of each head is not always human understandable in this way. They are going take on whatever meaning helps minimize the loss function of the training set the most.
The final output of Multi-Headed Causal Self Attention is the same size as the input to the self attention layer.
Below are an outline of all the steps in multi headed causal self attention. The steps shown below will map specifically to PyTorch code in the subsequent segment. These steps are meant to help visualize what is happening in the full attention operation.
Step 1: Multiply Input by Wqkv
In the above sections when referring to Wq, Wk, and Wv, we referred to them as separate matrices. While that is true and helpful conceptually, we concatenate them into one matrix to make the multi-headed self attention operation more efficient.
The first step is to multiply x by this weight matrix. This is done through a standard PyTorch linear layer. The resulting matrix will be our query, key, and value matrices concatenated.
Step 2: Split the Q, K, V Matrices
Using the split operation in PyTorch, we can split out the Q, K, and V matrices back to individual matrices.
Step 3: Reshape the Q, K, V Matrices Into Heads
Now that we have Q, K, and V Matrices, we can reshape them into heads. This operation should illustrate why in multi-headed self attention, it is required that the embedding dimension be divisible by the number of heads. The image below shows reshaping the Q matrix, but it should also be done for the K and V matrices in the same way.
Step 4: QK^T
Now we can perform the QK^T operation to get the query/key matches. This operation is the same as shown in self attention above, but now we have multiple heads. In our example we have 3 heads. All this means is that we are doing batch matrix multiplication, with the QK^T operation happening for each head in parallel. This means we have different query/key matches for each head.
Step 5: Mask Before Softmax
We take the result and apply the causal mask before softmax operation just like above. The main difference here is that the mask is applied to all 3 heads in parallel.
Step 6: Softmax & Multiply by V
We can then normalize and multiply by V to get the attended values.
Step 7: Merge Heads
We now have "V attended" which has 3 heads. We can merge these back together into a single matrix before sending them through a feedforward layer.
Step 8: Projection Layer
Finally, we feed the attended values through a linear layer, to get the final attention output. This final layer allows information to be combined and mixed between the heads, and projects the shape to match the input shape.
The final attention output can be thought of as the input tokens, but now cross pollinated with information from their interactions with each other.
The following code snippet shows an implementation of multi-headed causal self attention, building on our previous attention blocks. This is not the most compute efficient implementation due to the for loop for each head, but it is easier to read than the fully vectorized version and works for our use case due to the small datasets we are using.
class MultiHeadAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0, "n_embd must be divisible by n_head"
self.n_head = config.n_head
self.n_embd = config.n_embd
self.head_dim = config.n_embd // config.n_head
self.Wqkv = nn.Linear(self.n_embd, 3 * self.n_embd, bias=False).to(device)
self.proj = nn.Linear(self.n_embd, self.n_embd, bias=False).to(device)
# Causal mask to ensure that attention is only applied to previous tokens in the sequence
mask = torch.tril(
torch.ones(config.seq_len, config.seq_len, device=device, dtype=torch.bool)
)
self.register_buffer("causal_mask", mask.view(1, 1, config.seq_len, config.seq_len))
def forward(self, x):
B, seq_len, n_embd = x.shape # (batch, time, channels)
# 1) Multiply input by Wqkv to get queries, keys, values
qkv = self.Wqkv(x) # (B, seq_len, 3n_embd)
# 2) Split the Q, K, V matrices
q, k, v = qkv.split(n_embd, dim=2) # each (B, seq_len, n_embd)
# 3) Reshape the Q, K, V Matrices Into Heads
# (B,T,C) -> (B, seq_len, n_head, head_dim) -> (B, n_head, seq_len, head_dim)
q = q.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
k = k.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
v = v.view(B, seq_len, self.n_head, self.head_dim).transpose(1, 2)
# 4) QK^T
# (B, n_head, seq_len, seq_len)
att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
# 5) Mask Before Softmax
mask = self.causal_mask[:, :, :seq_len, :seq_len]
att = att.masked_fill(~mask, float("-inf"))
# 6) Softmax & Multiply by V
# (B, n_head, seq_len, head_dim)
att = F.softmax(att, dim=-1)
y = att @ v
# 7) Merge heads:
# (B, n_head, seq_len, head_dim) -> (B, seq_len, embedding_dim)
y = y.transpose(1, 2).contiguous().view(B, seq_len, n_embd)
# 8) Projection Layer
y = self.proj(y)
return y
multihead_attn = MultiHeadAttention(test_config)
test_out = multihead_attn(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)torch.Size([2, 4, 6])
We have now succesfully implemented multi-headed attention. There are just a few steps left until we have a GPT "block" that we can stack onto the network over and over again. The architecture of a GPT block is as follows:
So far we have built the text embedding, positional encoding, and masked multiheaded self attention parts. Now we need to add in the normalization layers and the feedforward layers. These are straightforward pytorch layers that are common across many neural network architectures.
The layer normalization layers are straghtforward and used in many deep learning architectures. It normalizes the values of the incoming matrix across the feature dimension (in our case dimension 2). It is used to stabilize training and achieve faster convergence.
The feedforward layer of the transformer block operates with a different paradigm than attention. While attention captures relationships between tokens, the feedforward layer applies the same transformation to each token in parallel. It can be implemented using standard pytorch linear layers. We are using a factor of 4 x embedding dimension for the size of the linear layer, as was done in the original attention is all you need paper. We use the Gaussian Error Linear Unit (GELU) activation function as is implemented in the original GPT paper.
class GPTBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.mha = MultiHeadAttention(config)
self.ln1 = nn.LayerNorm(config.n_embd).to(device)
self.ffn = nn.Sequential(
nn.Linear(config.n_embd, 4 * config.n_embd),
nn.GELU(),
nn.Linear(4 * config.n_embd, config.n_embd),
).to(device)
self.ln2 = nn.LayerNorm(config.n_embd).to(device)
def forward(self, x):
x = x + self.mha(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
block = GPTBlock(test_config)
test_out = block(test_embeddings_with_pos)
print(test_out.shape) # Output should have shape: (batch_size, seq_len, n_embd)torch.Size([2, 4, 6])
Now that we have a block, we can stack the blocks together multiple times to have a GPT style LLM model
class GPTModel(nn.Module):
def __init__(self, config):
super().__init__()
self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd).to(device)
self.position_encoding = get_position_encoding(config.seq_len, config.n_embd)
self.blocks = nn.Sequential(*[GPTBlock(config) for _ in range(config.n_layer)])
self.ln_f = nn.LayerNorm(config.n_embd).to(device)
self.head = nn.Linear(config.n_embd, config.vocab_size).to(device)
def forward(self, x):
x = self.token_embedding(x) + self.position_encoding
x = self.blocks(x)
x = self.ln_f(x)
return self.head(x)
gpt = GPTModel(test_config)
print(test_batch_inputs.shape)
test_out = gpt(test_batch_inputs)
print(test_out.shape) torch.Size([2, 4]) torch.Size([2, 4, 50257])
That is a full forward pass through the LLM, the input is of shape $[batch,tokens]$ and the output is of shape $[batch,tokens,probabilities]$. For each token given in the input, the LLM will predict a discrete probability distribution of the next token that comes after that.
The transformer makes multiple predictions of this in parallel, one for each token in the input. While all of them are used in training, only the last prediction (of token n) is used in inference to to the final predition.
The following diagram shows the full forward pass with shapes as one example moves through the matrix.
Now that we have gone through the forward pass of the model, we can train it. The model is trained using next token prediction
According to the original GPT paper, the objective function of pretraining is the following [1]:
$$L1(U) = \sum_{i}logP(u_i|u_{i-k}...u_{i-1};\theta)$$
Maximizing this objective function is essentially the same as minimizing the cross entropy loss function.
$$H(p, q) = -\sum_{x} p(x) \log q(x)$$
This is becuase during training, we use a one hot encoded vector for the true distribution, so p(x) is 1 for the correct token, and 0 for all other tokens. This means we can remove the sum and simplify the cross entropy loss to this:
$$H(p, q) = -\log P(u_i \mid u_{i-k}, \dots, u_{i-1}; \theta)$$
Pytorch has a pre-built cross-entropy loss function that can be used as our criterion to minimize [12].
We will first train the model with a small dataset (10 examples) and see if we can get the model to memorize/overfit to the dataset. This is a good test to ensure that our architecture is correct and getting the loss to reduce as expected.
# Example config:
batch_size = 10
sequence_len = 128
num_steps = 1000
train_inputs, train_targets, _, _ = get_dataset(10, sequence_len, 0)
config = GPTConfig(
vocab_size=tokenizer.n_vocab,
n_layer=4, # fewer layers for a quick demo
n_head=4,
n_embd=128,
seq_len=sequence_len,
)
# Create the GPT model
model = GPTModel(config)
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Define Scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=20, min_lr=5e-6, threshold=1e-4)
# Training loop
i = 1
losses = []
while i < num_steps:
for j in range(0, len(train_inputs), batch_size):
x = train_inputs[j:j+batch_size]
y = train_targets[j:j+batch_size]
# Forward pass
logits = model(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
losses.append(loss.item())
optimizer.step()
optimizer.zero_grad()
loss = loss.item()
scheduler.step(loss)
# Print the average loss for the epoch
lr = optimizer.param_groups[0]["lr"]
print(f"Step {i+1}/{num_steps}, Loss: {loss}, LR: {lr}")
i += 1
Step 2/1000, Loss: 11.028936386108398, LR: 0.0005 Step 3/1000, Loss: 10.818150520324707, LR: 0.0005 Step 4/1000, Loss: 10.620841026306152, LR: 0.0005 Step 5/1000, Loss: 10.440305709838867, LR: 0.0005 Step 6/1000, Loss: 10.277081489562988, LR: 0.0005 Step 7/1000, Loss: 10.129175186157227, LR: 0.0005 Step 8/1000, Loss: 9.992925643920898, LR: 0.0005 Step 9/1000, Loss: 9.864274978637695, LR: 0.0005 Step 10/1000, Loss: 9.739694595336914, LR: 0.0005 Step 11/1000, Loss: 9.616449356079102, LR: 0.0005 Step 12/1000, Loss: 9.492518424987793, LR: 0.0005 Step 13/1000, Loss: 9.366469383239746, LR: 0.0005 Step 14/1000, Loss: 9.23737621307373, LR: 0.0005 Step 15/1000, Loss: 9.104846000671387, LR: 0.0005 Step 16/1000, Loss: 8.969032287597656, LR: 0.0005 Step 17/1000, Loss: 8.830738067626953, LR: 0.0005 Step 18/1000, Loss: 8.691510200500488, LR: 0.0005 Step 19/1000, Loss: 8.55342960357666, LR: 0.0005 Step 20/1000, Loss: 8.41820240020752, LR: 0.0005 Step 21/1000, Loss: 8.285758972167969, LR: 0.0005 Step 22/1000, Loss: 8.154431343078613, LR: 0.0005 Step 23/1000, Loss: 8.022896766662598, LR: 0.0005 Step 24/1000, Loss: 7.890986442565918, LR: 0.0005 Step 25/1000, Loss: 7.758973121643066, LR: 0.0005 Step 26/1000, Loss: 7.627130031585693, LR: 0.0005 Step 27/1000, Loss: 7.4958906173706055, LR: 0.0005 Step 28/1000, Loss: 7.365780830383301, LR: 0.0005 Step 29/1000, Loss: 7.237158298492432, LR: 0.0005 Step 30/1000, Loss: 7.110237121582031, LR: 0.0005 Step 31/1000, Loss: 6.985257148742676, LR: 0.0005 Step 32/1000, Loss: 6.862329006195068, LR: 0.0005 Step 33/1000, Loss: 6.741218566894531, LR: 0.0005 Step 34/1000, Loss: 6.621609687805176, LR: 0.0005 Step 35/1000, Loss: 6.503492832183838, LR: 0.0005 Step 36/1000, Loss: 6.387082576751709, LR: 0.0005 Step 37/1000, Loss: 6.272494792938232, LR: 0.0005 Step 38/1000, Loss: 6.1597442626953125, LR: 0.0005 Step 39/1000, Loss: 6.0489373207092285, LR: 0.0005 Step 40/1000, Loss: 5.94016170501709, LR: 0.0005 Step 41/1000, Loss: 5.8332905769348145, LR: 0.0005 Step 42/1000, Loss: 5.728141784667969, LR: 0.0005 Step 43/1000, Loss: 5.624641418457031, LR: 0.0005 Step 44/1000, Loss: 5.522706031799316, LR: 0.0005 Step 45/1000, Loss: 5.422231197357178, LR: 0.0005 Step 46/1000, Loss: 5.323247909545898, LR: 0.0005 Step 47/1000, Loss: 5.225776672363281, LR: 0.0005 Step 48/1000, Loss: 5.129720687866211, LR: 0.0005 Step 49/1000, Loss: 5.034989833831787, LR: 0.0005 Step 50/1000, Loss: 4.941556453704834, LR: 0.0005 Step 51/1000, Loss: 4.849225997924805, LR: 0.0005 Step 52/1000, Loss: 4.757768630981445, LR: 0.0005 Step 53/1000, Loss: 4.667154788970947, LR: 0.0005 Step 54/1000, Loss: 4.57724666595459, LR: 0.0005 Step 55/1000, Loss: 4.4878716468811035, LR: 0.0005 Step 56/1000, Loss: 4.399011135101318, LR: 0.0005 Step 57/1000, Loss: 4.310593605041504, LR: 0.0005 Step 58/1000, Loss: 4.222630977630615, LR: 0.0005 Step 59/1000, Loss: 4.135863304138184, LR: 0.0005 Step 60/1000, Loss: 4.05898904800415, LR: 0.0005 Step 61/1000, Loss: 3.992307662963867, LR: 0.0005 Step 62/1000, Loss: 3.886232852935791, LR: 0.0005 Step 63/1000, Loss: 3.813814640045166, LR: 0.0005 Step 64/1000, Loss: 3.74583101272583, LR: 0.0005 Step 65/1000, Loss: 3.652099609375, LR: 0.0005 Step 66/1000, Loss: 3.586872100830078, LR: 0.0005 Step 67/1000, Loss: 3.505384922027588, LR: 0.0005 Step 68/1000, Loss: 3.4279913902282715, LR: 0.0005 Step 69/1000, Loss: 3.3585686683654785, LR: 0.0005 Step 70/1000, Loss: 3.275824785232544, LR: 0.0005 Step 71/1000, Loss: 3.2014312744140625, LR: 0.0005 Step 72/1000, Loss: 3.131232976913452, LR: 0.0005 Step 73/1000, Loss: 3.049182891845703, LR: 0.0005 Step 74/1000, Loss: 2.9816317558288574, LR: 0.0005 Step 75/1000, Loss: 2.9068069458007812, LR: 0.0005 Step 76/1000, Loss: 2.8312482833862305, LR: 0.0005 Step 77/1000, Loss: 2.7614734172821045, LR: 0.0005 Step 78/1000, Loss: 2.6829771995544434, LR: 0.0005 Step 79/1000, Loss: 2.6195552349090576, LR: 0.0005 Step 80/1000, Loss: 2.5445303916931152, LR: 0.0005 Step 81/1000, Loss: 2.4811675548553467, LR: 0.0005 Step 82/1000, Loss: 2.4033358097076416, LR: 0.0005 Step 83/1000, Loss: 2.3365752696990967, LR: 0.0005 Step 84/1000, Loss: 2.286520481109619, LR: 0.0005 Step 85/1000, Loss: 2.2043795585632324, LR: 0.0005 Step 86/1000, Loss: 2.1443023681640625, LR: 0.0005 Step 87/1000, Loss: 2.0734801292419434, LR: 0.0005 Step 88/1000, Loss: 2.041525363922119, LR: 0.0005 Step 89/1000, Loss: 1.9522135257720947, LR: 0.0005 Step 90/1000, Loss: 1.938239336013794, LR: 0.0005 Step 91/1000, Loss: 1.8657023906707764, LR: 0.0005 Step 92/1000, Loss: 1.8048756122589111, LR: 0.0005 Step 93/1000, Loss: 1.7578691244125366, LR: 0.0005 Step 94/1000, Loss: 1.701775312423706, LR: 0.0005 Step 95/1000, Loss: 1.6422252655029297, LR: 0.0005 Step 96/1000, Loss: 1.6149051189422607, LR: 0.0005 Step 97/1000, Loss: 1.5551875829696655, LR: 0.0005 Step 98/1000, Loss: 1.5064985752105713, LR: 0.0005 Step 99/1000, Loss: 1.4643620252609253, LR: 0.0005 Step 100/1000, Loss: 1.4139996767044067, LR: 0.0005 Step 101/1000, Loss: 1.385868787765503, LR: 0.0005 Step 102/1000, Loss: 1.3312125205993652, LR: 0.0005 Step 103/1000, Loss: 1.2840286493301392, LR: 0.0005 Step 104/1000, Loss: 1.2414436340332031, LR: 0.0005 Step 105/1000, Loss: 1.1958096027374268, LR: 0.0005 Step 106/1000, Loss: 1.1562128067016602, LR: 0.0005 Step 107/1000, Loss: 1.1319259405136108, LR: 0.0005 Step 108/1000, Loss: 1.0711021423339844, LR: 0.0005 Step 109/1000, Loss: 1.047868013381958, LR: 0.0005 Step 110/1000, Loss: 1.0212820768356323, LR: 0.0005 Step 111/1000, Loss: 0.9612825512886047, LR: 0.0005 Step 112/1000, Loss: 0.973650336265564, LR: 0.0005 Step 113/1000, Loss: 0.9346786737442017, LR: 0.0005 Step 114/1000, Loss: 0.9234967231750488, LR: 0.0005 Step 115/1000, Loss: 0.8635486364364624, LR: 0.0005 Step 116/1000, Loss: 0.838901698589325, LR: 0.0005 Step 117/1000, Loss: 0.8201435804367065, LR: 0.0005 Step 118/1000, Loss: 0.7784948348999023, LR: 0.0005 Step 119/1000, Loss: 0.7447291612625122, LR: 0.0005 Step 120/1000, Loss: 0.7342116236686707, LR: 0.0005 Step 121/1000, Loss: 0.7056399583816528, LR: 0.0005 Step 122/1000, Loss: 0.6909168362617493, LR: 0.0005 Step 123/1000, Loss: 0.6604924201965332, LR: 0.0005 Step 124/1000, Loss: 0.6375840306282043, LR: 0.0005 Step 125/1000, Loss: 0.6168572902679443, LR: 0.0005 Step 126/1000, Loss: 0.5916572213172913, LR: 0.0005 Step 127/1000, Loss: 0.5694154500961304, LR: 0.0005 Step 128/1000, Loss: 0.5482650995254517, LR: 0.0005 Step 129/1000, Loss: 0.5271603465080261, LR: 0.0005 Step 130/1000, Loss: 0.5043598413467407, LR: 0.0005 Step 131/1000, Loss: 0.484113872051239, LR: 0.0005 Step 132/1000, Loss: 0.4653652310371399, LR: 0.0005 Step 133/1000, Loss: 0.44524794816970825, LR: 0.0005 Step 134/1000, Loss: 0.425358384847641, LR: 0.0005 Step 135/1000, Loss: 0.40840944647789, LR: 0.0005 Step 136/1000, Loss: 0.3909088373184204, LR: 0.0005 Step 137/1000, Loss: 0.37400251626968384, LR: 0.0005 Step 138/1000, Loss: 0.3574616611003876, LR: 0.0005 Step 139/1000, Loss: 0.3430023789405823, LR: 0.0005 Step 140/1000, Loss: 0.32840004563331604, LR: 0.0005 Step 141/1000, Loss: 0.31408971548080444, LR: 0.0005 Step 142/1000, Loss: 0.30126383900642395, LR: 0.0005 Step 143/1000, Loss: 0.2889065146446228, LR: 0.0005 Step 144/1000, Loss: 0.27704042196273804, LR: 0.0005 Step 145/1000, Loss: 0.265825480222702, LR: 0.0005 Step 146/1000, Loss: 0.25510314106941223, LR: 0.0005 Step 147/1000, Loss: 0.24541497230529785, LR: 0.0005 Step 148/1000, Loss: 0.23576955497264862, LR: 0.0005 Step 149/1000, Loss: 0.226886585354805, LR: 0.0005 Step 150/1000, Loss: 0.2183711975812912, LR: 0.0005 Step 151/1000, Loss: 0.21046686172485352, LR: 0.0005 Step 152/1000, Loss: 0.20287629961967468, LR: 0.0005 Step 153/1000, Loss: 0.19582954049110413, LR: 0.0005 Step 154/1000, Loss: 0.18901699781417847, LR: 0.0005 Step 155/1000, Loss: 0.18258051574230194, LR: 0.0005 Step 156/1000, Loss: 0.17666363716125488, LR: 0.0005 Step 157/1000, Loss: 0.1709659993648529, LR: 0.0005 Step 158/1000, Loss: 0.16547448933124542, LR: 0.0005 Step 159/1000, Loss: 0.1603526622056961, LR: 0.0005 Step 160/1000, Loss: 0.15554237365722656, LR: 0.0005 Step 161/1000, Loss: 0.15095333755016327, LR: 0.0005 Step 162/1000, Loss: 0.14654694497585297, LR: 0.0005 Step 163/1000, Loss: 0.14236226677894592, LR: 0.0005 Step 164/1000, Loss: 0.13846130669116974, LR: 0.0005 Step 165/1000, Loss: 0.13469958305358887, LR: 0.0005 Step 166/1000, Loss: 0.13111920654773712, LR: 0.0005 Step 167/1000, Loss: 0.12771472334861755, LR: 0.0005 Step 168/1000, Loss: 0.12444841861724854, LR: 0.0005 Step 169/1000, Loss: 0.1213715523481369, LR: 0.0005 Step 170/1000, Loss: 0.1184251680970192, LR: 0.0005 Step 171/1000, Loss: 0.11558564007282257, LR: 0.0005 Step 172/1000, Loss: 0.11287212371826172, LR: 0.0005 Step 173/1000, Loss: 0.11030000448226929, LR: 0.0005 Step 174/1000, Loss: 0.10783600807189941, LR: 0.0005 Step 175/1000, Loss: 0.10544748604297638, LR: 0.0005 Step 176/1000, Loss: 0.10317704826593399, LR: 0.0005 Step 177/1000, Loss: 0.10100039094686508, LR: 0.0005 Step 178/1000, Loss: 0.09890035539865494, LR: 0.0005 Step 179/1000, Loss: 0.09689100831747055, LR: 0.0005 Step 180/1000, Loss: 0.09495634585618973, LR: 0.0005 Step 181/1000, Loss: 0.09309490770101547, LR: 0.0005 Step 182/1000, Loss: 0.09130214154720306, LR: 0.0005 Step 183/1000, Loss: 0.0895790159702301, LR: 0.0005 Step 184/1000, Loss: 0.08791748434305191, LR: 0.0005 Step 185/1000, Loss: 0.08630998432636261, LR: 0.0005 Step 186/1000, Loss: 0.08476689457893372, LR: 0.0005 Step 187/1000, Loss: 0.08327476680278778, LR: 0.0005 Step 188/1000, Loss: 0.0818280428647995, LR: 0.0005 Step 189/1000, Loss: 0.08043530583381653, LR: 0.0005 Step 190/1000, Loss: 0.07908846437931061, LR: 0.0005 Step 191/1000, Loss: 0.07778267562389374, LR: 0.0005 Step 192/1000, Loss: 0.07651858031749725, LR: 0.0005 Step 193/1000, Loss: 0.07529474794864655, LR: 0.0005 Step 194/1000, Loss: 0.0741095021367073, LR: 0.0005 Step 195/1000, Loss: 0.07295864820480347, LR: 0.0005 Step 196/1000, Loss: 0.07184220105409622, LR: 0.0005 Step 197/1000, Loss: 0.07076020538806915, LR: 0.0005 Step 198/1000, Loss: 0.06970785558223724, LR: 0.0005 Step 199/1000, Loss: 0.0686848908662796, LR: 0.0005 Step 200/1000, Loss: 0.06769225746393204, LR: 0.0005 Step 201/1000, Loss: 0.06672650575637817, LR: 0.0005 Step 202/1000, Loss: 0.06578618288040161, LR: 0.0005 Step 203/1000, Loss: 0.06487169116735458, LR: 0.0005 Step 204/1000, Loss: 0.06398141384124756, LR: 0.0005 Step 205/1000, Loss: 0.06311395764350891, LR: 0.0005 Step 206/1000, Loss: 0.06226874515414238, LR: 0.0005 Step 207/1000, Loss: 0.061445146799087524, LR: 0.0005 Step 208/1000, Loss: 0.060642071068286896, LR: 0.0005 Step 209/1000, Loss: 0.05985843390226364, LR: 0.0005 Step 210/1000, Loss: 0.059093911200761795, LR: 0.0005 Step 211/1000, Loss: 0.058347903192043304, LR: 0.0005 Step 212/1000, Loss: 0.05761934444308281, LR: 0.0005 Step 213/1000, Loss: 0.056907713413238525, LR: 0.0005 Step 214/1000, Loss: 0.05621255189180374, LR: 0.0005 Step 215/1000, Loss: 0.05553319305181503, LR: 0.0005 Step 216/1000, Loss: 0.05486900731921196, LR: 0.0005 Step 217/1000, Loss: 0.05421954393386841, LR: 0.0005 Step 218/1000, Loss: 0.053584396839141846, LR: 0.0005 Step 219/1000, Loss: 0.05296294763684273, LR: 0.0005 Step 220/1000, Loss: 0.052354730665683746, LR: 0.0005 Step 221/1000, Loss: 0.05175943300127983, LR: 0.0005 Step 222/1000, Loss: 0.05117657035589218, LR: 0.0005 Step 223/1000, Loss: 0.05060570314526558, LR: 0.0005 Step 224/1000, Loss: 0.050046540796756744, LR: 0.0005 Step 225/1000, Loss: 0.04949868842959404, LR: 0.0005 Step 226/1000, Loss: 0.04896175116300583, LR: 0.0005 Step 227/1000, Loss: 0.04843541607260704, LR: 0.0005 Step 228/1000, Loss: 0.047919414937496185, LR: 0.0005 Step 229/1000, Loss: 0.04741339758038521, LR: 0.0005 Step 230/1000, Loss: 0.04691707342863083, LR: 0.0005 Step 231/1000, Loss: 0.046430159360170364, LR: 0.0005 Step 232/1000, Loss: 0.045952390879392624, LR: 0.0005 Step 233/1000, Loss: 0.04548350349068642, LR: 0.0005 Step 234/1000, Loss: 0.04502324387431145, LR: 0.0005 Step 235/1000, Loss: 0.04457138478755951, LR: 0.0005 Step 236/1000, Loss: 0.044127706438302994, LR: 0.0005 Step 237/1000, Loss: 0.0436919629573822, LR: 0.0005 Step 238/1000, Loss: 0.043263912200927734, LR: 0.0005 Step 239/1000, Loss: 0.042843401432037354, LR: 0.0005 Step 240/1000, Loss: 0.04243022948503494, LR: 0.0005 Step 241/1000, Loss: 0.0420241579413414, LR: 0.0005 Step 242/1000, Loss: 0.041625045239925385, LR: 0.0005 Step 243/1000, Loss: 0.04123268276453018, LR: 0.0005 Step 244/1000, Loss: 0.04084692522883415, LR: 0.0005 Step 245/1000, Loss: 0.040467601269483566, LR: 0.0005 Step 246/1000, Loss: 0.040094539523124695, LR: 0.0005 Step 247/1000, Loss: 0.0397275909781456, LR: 0.0005 Step 248/1000, Loss: 0.039366595447063446, LR: 0.0005 Step 249/1000, Loss: 0.03901142627000809, LR: 0.0005 Step 250/1000, Loss: 0.03866194561123848, LR: 0.0005 Step 251/1000, Loss: 0.038318000733852386, LR: 0.0005 Step 252/1000, Loss: 0.03797946125268936, LR: 0.0005 Step 253/1000, Loss: 0.037646204233169556, LR: 0.0005 Step 254/1000, Loss: 0.037318117916584015, LR: 0.0005 Step 255/1000, Loss: 0.03699507564306259, LR: 0.0005 Step 256/1000, Loss: 0.036676958203315735, LR: 0.0005 Step 257/1000, Loss: 0.03636365383863449, LR: 0.0005 Step 258/1000, Loss: 0.0360550656914711, LR: 0.0005 Step 259/1000, Loss: 0.03575107082724571, LR: 0.0005 Step 260/1000, Loss: 0.035451583564281464, LR: 0.0005 Step 261/1000, Loss: 0.0351564958691597, LR: 0.0005 Step 262/1000, Loss: 0.034865714609622955, LR: 0.0005 Step 263/1000, Loss: 0.03457915037870407, LR: 0.0005 Step 264/1000, Loss: 0.03429670259356499, LR: 0.0005 Step 265/1000, Loss: 0.03401828929781914, LR: 0.0005 Step 266/1000, Loss: 0.03374383971095085, LR: 0.0005 Step 267/1000, Loss: 0.03347324579954147, LR: 0.0005 Step 268/1000, Loss: 0.033206455409526825, LR: 0.0005 Step 269/1000, Loss: 0.03294336050748825, LR: 0.0005 Step 270/1000, Loss: 0.032683901488780975, LR: 0.0005 Step 271/1000, Loss: 0.03242800012230873, LR: 0.0005 Step 272/1000, Loss: 0.032175593078136444, LR: 0.0005 Step 273/1000, Loss: 0.03192659839987755, LR: 0.0005 Step 274/1000, Loss: 0.03168096765875816, LR: 0.0005 Step 275/1000, Loss: 0.03143859654664993, LR: 0.0005 Step 276/1000, Loss: 0.031199470162391663, LR: 0.0005 Step 277/1000, Loss: 0.030963491648435593, LR: 0.0005 Step 278/1000, Loss: 0.03073061630129814, LR: 0.0005 Step 279/1000, Loss: 0.030500758439302444, LR: 0.0005 Step 280/1000, Loss: 0.030273890122771263, LR: 0.0005 Step 281/1000, Loss: 0.030049938708543777, LR: 0.0005 Step 282/1000, Loss: 0.02982885204255581, LR: 0.0005 Step 283/1000, Loss: 0.029610583558678627, LR: 0.0005 Step 284/1000, Loss: 0.02939506433904171, LR: 0.0005 Step 285/1000, Loss: 0.02918226085603237, LR: 0.0005 Step 286/1000, Loss: 0.02897210791707039, LR: 0.0005 Step 287/1000, Loss: 0.028764575719833374, LR: 0.0005 Step 288/1000, Loss: 0.0285596065223217, LR: 0.0005 Step 289/1000, Loss: 0.02835715375840664, LR: 0.0005 Step 290/1000, Loss: 0.02815716341137886, LR: 0.0005 Step 291/1000, Loss: 0.027959603816270828, LR: 0.0005 Step 292/1000, Loss: 0.02776442840695381, LR: 0.0005 Step 293/1000, Loss: 0.027571607381105423, LR: 0.0005 Step 294/1000, Loss: 0.02738107182085514, LR: 0.0005 Step 295/1000, Loss: 0.02719280496239662, LR: 0.0005 Step 296/1000, Loss: 0.027006756514310837, LR: 0.0005 Step 297/1000, Loss: 0.026822898536920547, LR: 0.0005 Step 298/1000, Loss: 0.02664119005203247, LR: 0.0005 Step 299/1000, Loss: 0.026461582630872726, LR: 0.0005 Step 300/1000, Loss: 0.026284050196409225, LR: 0.0005 Step 301/1000, Loss: 0.02610856667160988, LR: 0.0005 Step 302/1000, Loss: 0.025935083627700806, LR: 0.0005 Step 303/1000, Loss: 0.025763574987649918, LR: 0.0005 Step 304/1000, Loss: 0.025593992322683334, LR: 0.0005 Step 305/1000, Loss: 0.025426337495446205, LR: 0.0005 Step 306/1000, Loss: 0.025260552763938904, LR: 0.0005 Step 307/1000, Loss: 0.025096606463193893, LR: 0.0005 Step 308/1000, Loss: 0.024934494867920876, LR: 0.0005 Step 309/1000, Loss: 0.024774150922894478, LR: 0.0005 Step 310/1000, Loss: 0.0246155746281147, LR: 0.0005 Step 311/1000, Loss: 0.02445872686803341, LR: 0.0005 Step 312/1000, Loss: 0.024303575977683067, LR: 0.0005 Step 313/1000, Loss: 0.024150118231773376, LR: 0.0005 Step 314/1000, Loss: 0.02399829402565956, LR: 0.0005 Step 315/1000, Loss: 0.023848097771406174, LR: 0.0005 Step 316/1000, Loss: 0.023699497804045677, LR: 0.0005 Step 317/1000, Loss: 0.023552479222416878, LR: 0.0005 Step 318/1000, Loss: 0.02340700477361679, LR: 0.0005 Step 319/1000, Loss: 0.023263057693839073, LR: 0.0005 Step 320/1000, Loss: 0.023120611906051636, LR: 0.0005 Step 321/1000, Loss: 0.02297966368496418, LR: 0.0005 Step 322/1000, Loss: 0.022840162739157677, LR: 0.0005 Step 323/1000, Loss: 0.022702092304825783, LR: 0.0005 Step 324/1000, Loss: 0.02256545051932335, LR: 0.0005 Step 325/1000, Loss: 0.022430190816521645, LR: 0.0005 Step 326/1000, Loss: 0.02229630947113037, LR: 0.0005 Step 327/1000, Loss: 0.02216377481818199, LR: 0.0005 Step 328/1000, Loss: 0.022032588720321655, LR: 0.0005 Step 329/1000, Loss: 0.021902697160840034, LR: 0.0005 Step 330/1000, Loss: 0.021774107590317726, LR: 0.0005 Step 331/1000, Loss: 0.02164679393172264, LR: 0.0005 Step 332/1000, Loss: 0.021520748734474182, LR: 0.0005 Step 333/1000, Loss: 0.021395931020379066, LR: 0.0005 Step 334/1000, Loss: 0.02127234637737274, LR: 0.0005 Step 335/1000, Loss: 0.021149959415197372, LR: 0.0005 Step 336/1000, Loss: 0.02102876640856266, LR: 0.0005 Step 337/1000, Loss: 0.020908737555146217, LR: 0.0005 Step 338/1000, Loss: 0.020789876580238342, LR: 0.0005 Step 339/1000, Loss: 0.020672138780355453, LR: 0.0005 Step 340/1000, Loss: 0.020555540919303894, LR: 0.0005 Step 341/1000, Loss: 0.020440038293600082, LR: 0.0005 Step 342/1000, Loss: 0.02032562717795372, LR: 0.0005 Step 343/1000, Loss: 0.020212315022945404, LR: 0.0005 Step 344/1000, Loss: 0.020100053399801254, LR: 0.0005 Step 345/1000, Loss: 0.019988834857940674, LR: 0.0005 Step 346/1000, Loss: 0.019878655672073364, LR: 0.0005 Step 347/1000, Loss: 0.01976950094103813, LR: 0.0005 Step 348/1000, Loss: 0.019661344587802887, LR: 0.0005 Step 349/1000, Loss: 0.019554201513528824, LR: 0.0005 Step 350/1000, Loss: 0.01944802701473236, LR: 0.0005 Step 351/1000, Loss: 0.0193428136408329, LR: 0.0005 Step 352/1000, Loss: 0.01923857256770134, LR: 0.0005 Step 353/1000, Loss: 0.019135264679789543, LR: 0.0005 Step 354/1000, Loss: 0.019032880663871765, LR: 0.0005 Step 355/1000, Loss: 0.018931429833173752, LR: 0.0005 Step 356/1000, Loss: 0.018830865621566772, LR: 0.0005 Step 357/1000, Loss: 0.01873120665550232, LR: 0.0005 Step 358/1000, Loss: 0.01863243244588375, LR: 0.0005 Step 359/1000, Loss: 0.01853453740477562, LR: 0.0005 Step 360/1000, Loss: 0.018437493592500687, LR: 0.0005 Step 361/1000, Loss: 0.0183413065969944, LR: 0.0005 Step 362/1000, Loss: 0.01824595034122467, LR: 0.0005 Step 363/1000, Loss: 0.018151428550481796, LR: 0.0005 Step 364/1000, Loss: 0.018057730048894882, LR: 0.0005 Step 365/1000, Loss: 0.017964834347367287, LR: 0.0005 Step 366/1000, Loss: 0.017872732132673264, LR: 0.0005 Step 367/1000, Loss: 0.01778143271803856, LR: 0.0005 Step 368/1000, Loss: 0.017690904438495636, LR: 0.0005 Step 369/1000, Loss: 0.017601147294044495, LR: 0.0005 Step 370/1000, Loss: 0.017512155696749687, LR: 0.0005 Step 371/1000, Loss: 0.017423905432224274, LR: 0.0005 Step 372/1000, Loss: 0.017336402088403702, LR: 0.0005 Step 373/1000, Loss: 0.017249632626771927, LR: 0.0005 Step 374/1000, Loss: 0.017163578420877457, LR: 0.0005 Step 375/1000, Loss: 0.017078256234526634, LR: 0.0005 Step 376/1000, Loss: 0.01699363812804222, LR: 0.0005 Step 377/1000, Loss: 0.016909711062908173, LR: 0.0005 Step 378/1000, Loss: 0.01682646945118904, LR: 0.0005 Step 379/1000, Loss: 0.016743911430239677, LR: 0.0005 Step 380/1000, Loss: 0.016662027686834335, LR: 0.0005 Step 381/1000, Loss: 0.016580814495682716, LR: 0.0005 Step 382/1000, Loss: 0.016500255092978477, LR: 0.0005 Step 383/1000, Loss: 0.016420351341366768, LR: 0.0005 Step 384/1000, Loss: 0.016341084614396095, LR: 0.0005 Step 385/1000, Loss: 0.016262447461485863, LR: 0.0005 Step 386/1000, Loss: 0.016184460371732712, LR: 0.0005 Step 387/1000, Loss: 0.016107091680169106, LR: 0.0005 Step 388/1000, Loss: 0.01603032648563385, LR: 0.0005 Step 389/1000, Loss: 0.015954164788126945, LR: 0.0005 Step 390/1000, Loss: 0.015878615900874138, LR: 0.0005 Step 391/1000, Loss: 0.01580365188419819, LR: 0.0005 Step 392/1000, Loss: 0.015729276463389397, LR: 0.0005 Step 393/1000, Loss: 0.01565549336373806, LR: 0.0005 Step 394/1000, Loss: 0.015582269057631493, LR: 0.0005 Step 395/1000, Loss: 0.015509620308876038, LR: 0.0005 Step 396/1000, Loss: 0.01543753407895565, LR: 0.0005 Step 397/1000, Loss: 0.01536600012332201, LR: 0.0005 Step 398/1000, Loss: 0.01529501099139452, LR: 0.0005 Step 399/1000, Loss: 0.015224570408463478, LR: 0.0005 Step 400/1000, Loss: 0.015154657885432243, LR: 0.0005 Step 401/1000, Loss: 0.015085290186107159, LR: 0.0005 Step 402/1000, Loss: 0.015016439370810986, LR: 0.0005 Step 403/1000, Loss: 0.014948099851608276, LR: 0.0005 Step 404/1000, Loss: 0.01488029770553112, LR: 0.0005 Step 405/1000, Loss: 0.014812981709837914, LR: 0.0005 Step 406/1000, Loss: 0.014746179804205894, LR: 0.0005 Step 407/1000, Loss: 0.014679880812764168, LR: 0.0005 Step 408/1000, Loss: 0.014614063315093517, LR: 0.0005 Step 409/1000, Loss: 0.01454873662441969, LR: 0.0005 Step 410/1000, Loss: 0.014483893290162086, LR: 0.0005 Step 411/1000, Loss: 0.01441953144967556, LR: 0.0005 Step 412/1000, Loss: 0.01435563899576664, LR: 0.0005 Step 413/1000, Loss: 0.014292215928435326, LR: 0.0005 Step 414/1000, Loss: 0.014229250140488148, LR: 0.0005 Step 415/1000, Loss: 0.014166750013828278, LR: 0.0005 Step 416/1000, Loss: 0.014104691334068775, LR: 0.0005 Step 417/1000, Loss: 0.014043083414435387, LR: 0.0005 Step 418/1000, Loss: 0.013981923460960388, LR: 0.0005 Step 419/1000, Loss: 0.013921191915869713, LR: 0.0005 Step 420/1000, Loss: 0.013860913924872875, LR: 0.0005 Step 421/1000, Loss: 0.013801050372421741, LR: 0.0005 Step 422/1000, Loss: 0.013741618022322655, LR: 0.0005 Step 423/1000, Loss: 0.01368260383605957, LR: 0.0005 Step 424/1000, Loss: 0.013624010607600212, LR: 0.0005 Step 425/1000, Loss: 0.01356582622975111, LR: 0.0005 Step 426/1000, Loss: 0.013508048839867115, LR: 0.0005 Step 427/1000, Loss: 0.01345068495720625, LR: 0.0005 Step 428/1000, Loss: 0.013393727131187916, LR: 0.0005 Step 429/1000, Loss: 0.013337147422134876, LR: 0.0005 Step 430/1000, Loss: 0.01328097004443407, LR: 0.0005 Step 431/1000, Loss: 0.013225178234279156, LR: 0.0005 Step 432/1000, Loss: 0.013169774785637856, LR: 0.0005 Step 433/1000, Loss: 0.013114752247929573, LR: 0.0005 Step 434/1000, Loss: 0.01306010503321886, LR: 0.0005 Step 435/1000, Loss: 0.013005835004150867, LR: 0.0005 Step 436/1000, Loss: 0.012951930984854698, LR: 0.0005 Step 437/1000, Loss: 0.012898397631943226, LR: 0.0005 Step 438/1000, Loss: 0.012845223769545555, LR: 0.0005 Step 439/1000, Loss: 0.012792408466339111, LR: 0.0005 Step 440/1000, Loss: 0.012739946134388447, LR: 0.0005 Step 441/1000, Loss: 0.012687849812209606, LR: 0.0005 Step 442/1000, Loss: 0.012636087834835052, LR: 0.0005 Step 443/1000, Loss: 0.012584683485329151, LR: 0.0005 Step 444/1000, Loss: 0.012533617205917835, LR: 0.0005 Step 445/1000, Loss: 0.012482896447181702, LR: 0.0005 Step 446/1000, Loss: 0.012432500720024109, LR: 0.0005 Step 447/1000, Loss: 0.012382445856928825, LR: 0.0005 Step 448/1000, Loss: 0.01233270950615406, LR: 0.0005 Step 449/1000, Loss: 0.012283317744731903, LR: 0.0005 Step 450/1000, Loss: 0.012234242632985115, LR: 0.0005 Step 451/1000, Loss: 0.012185484170913696, LR: 0.0005 Step 452/1000, Loss: 0.012137046083807945, LR: 0.0005 Step 453/1000, Loss: 0.012088925577700138, LR: 0.0005 Step 454/1000, Loss: 0.012041114270687103, LR: 0.0005 Step 455/1000, Loss: 0.01199361402541399, LR: 0.0005 Step 456/1000, Loss: 0.011946414597332478, LR: 0.0005 Step 457/1000, Loss: 0.011899525299668312, LR: 0.0005 Step 458/1000, Loss: 0.011852930299937725, LR: 0.0005 Step 459/1000, Loss: 0.011806638911366463, LR: 0.0005 Step 460/1000, Loss: 0.011760639026761055, LR: 0.0005 Step 461/1000, Loss: 0.011714932508766651, LR: 0.0005 Step 462/1000, Loss: 0.011669519357383251, LR: 0.0005 Step 463/1000, Loss: 0.011624393984675407, LR: 0.0005 Step 464/1000, Loss: 0.011579541489481926, LR: 0.0005 Step 465/1000, Loss: 0.011534983292222023, LR: 0.0005 Step 466/1000, Loss: 0.011490697041153908, LR: 0.0005 Step 467/1000, Loss: 0.011446688324213028, LR: 0.0005 Step 468/1000, Loss: 0.011402969248592854, LR: 0.0005 Step 469/1000, Loss: 0.011359510011970997, LR: 0.0005 Step 470/1000, Loss: 0.011316319927573204, LR: 0.0005 Step 471/1000, Loss: 0.011273396201431751, LR: 0.0005 Step 472/1000, Loss: 0.011230741627514362, LR: 0.0005 Step 473/1000, Loss: 0.011188351549208164, LR: 0.0005 Step 474/1000, Loss: 0.011146224103868008, LR: 0.0005 Step 475/1000, Loss: 0.01110434252768755, LR: 0.0005 Step 476/1000, Loss: 0.011062730103731155, LR: 0.0005 Step 477/1000, Loss: 0.011021362617611885, LR: 0.0005 Step 478/1000, Loss: 0.010980254039168358, LR: 0.0005 Step 479/1000, Loss: 0.010939392261207104, LR: 0.0005 Step 480/1000, Loss: 0.010898780077695847, LR: 0.0005 Step 481/1000, Loss: 0.010858409106731415, LR: 0.0005 Step 482/1000, Loss: 0.010818282142281532, LR: 0.0005 Step 483/1000, Loss: 0.010778399184346199, LR: 0.0005 Step 484/1000, Loss: 0.01073874719440937, LR: 0.0005 Step 485/1000, Loss: 0.01069933082908392, LR: 0.0005 Step 486/1000, Loss: 0.010660158470273018, LR: 0.0005 Step 487/1000, Loss: 0.010621210560202599, LR: 0.0005 Step 488/1000, Loss: 0.01058250479400158, LR: 0.0005 Step 489/1000, Loss: 0.010544024407863617, LR: 0.0005 Step 490/1000, Loss: 0.010505775921046734, LR: 0.0005 Step 491/1000, Loss: 0.01046774536371231, LR: 0.0005 Step 492/1000, Loss: 0.01042993925511837, LR: 0.0005 Step 493/1000, Loss: 0.01039235107600689, LR: 0.0005 Step 494/1000, Loss: 0.010354990139603615, LR: 0.0005 Step 495/1000, Loss: 0.010317839682102203, LR: 0.0005 Step 496/1000, Loss: 0.010280909948050976, LR: 0.0005 Step 497/1000, Loss: 0.01024419255554676, LR: 0.0005 Step 498/1000, Loss: 0.01020768377929926, LR: 0.0005 Step 499/1000, Loss: 0.010171398520469666, LR: 0.0005 Step 500/1000, Loss: 0.010135313495993614, LR: 0.0005 Step 501/1000, Loss: 0.010099424049258232, LR: 0.0005 Step 502/1000, Loss: 0.010063758119940758, LR: 0.0005 Step 503/1000, Loss: 0.010028297081589699, LR: 0.0005 Step 504/1000, Loss: 0.00999302975833416, LR: 0.0005 Step 505/1000, Loss: 0.009957965463399887, LR: 0.0005 Step 506/1000, Loss: 0.009923097677528858, LR: 0.0005 Step 507/1000, Loss: 0.009888437576591969, LR: 0.0005 Step 508/1000, Loss: 0.00985395722091198, LR: 0.0005 Step 509/1000, Loss: 0.009819683618843555, LR: 0.0005 Step 510/1000, Loss: 0.00978559534996748, LR: 0.0005 Step 511/1000, Loss: 0.009751707315444946, LR: 0.0005 Step 512/1000, Loss: 0.009718005545437336, LR: 0.0005 Step 513/1000, Loss: 0.009684493765234947, LR: 0.0005 Step 514/1000, Loss: 0.009651164524257183, LR: 0.0005 Step 515/1000, Loss: 0.009618023410439491, LR: 0.0005 Step 516/1000, Loss: 0.009585065767168999, LR: 0.0005 Step 517/1000, Loss: 0.009552285075187683, LR: 0.0005 Step 518/1000, Loss: 0.009519694373011589, LR: 0.0005 Step 519/1000, Loss: 0.009487281553447247, LR: 0.0005 Step 520/1000, Loss: 0.009455042891204357, LR: 0.0005 Step 521/1000, Loss: 0.00942298211157322, LR: 0.0005 Step 522/1000, Loss: 0.009391089901328087, LR: 0.0005 Step 523/1000, Loss: 0.009359384886920452, LR: 0.0005 Step 524/1000, Loss: 0.0093278419226408, LR: 0.0005 Step 525/1000, Loss: 0.009296474978327751, LR: 0.0005 Step 526/1000, Loss: 0.00926528126001358, LR: 0.0005 Step 527/1000, Loss: 0.009234251454472542, LR: 0.0005 Step 528/1000, Loss: 0.009203390218317509, LR: 0.0005 Step 529/1000, Loss: 0.009172694757580757, LR: 0.0005 Step 530/1000, Loss: 0.009142170660197735, LR: 0.0005 Step 531/1000, Loss: 0.00911181140691042, LR: 0.0005 Step 532/1000, Loss: 0.009081612341105938, LR: 0.0005 Step 533/1000, Loss: 0.009051559492945671, LR: 0.0005 Step 534/1000, Loss: 0.009021690115332603, LR: 0.0005 Step 535/1000, Loss: 0.008991964161396027, LR: 0.0005 Step 536/1000, Loss: 0.00896239373832941, LR: 0.0005 Step 537/1000, Loss: 0.008932976052165031, LR: 0.0005 Step 538/1000, Loss: 0.008903736248612404, LR: 0.0005 Step 539/1000, Loss: 0.008874637074768543, LR: 0.0005 Step 540/1000, Loss: 0.008845696225762367, LR: 0.0005 Step 541/1000, Loss: 0.008816898800432682, LR: 0.0005 Step 542/1000, Loss: 0.008788255043327808, LR: 0.0005 Step 543/1000, Loss: 0.008759756572544575, LR: 0.0005 Step 544/1000, Loss: 0.008731414563953876, LR: 0.0005 Step 545/1000, Loss: 0.008703215047717094, LR: 0.0005 Step 546/1000, Loss: 0.0086751664057374, LR: 0.0005 Step 547/1000, Loss: 0.008647261187434196, LR: 0.0005 Step 548/1000, Loss: 0.008619503118097782, LR: 0.0005 Step 549/1000, Loss: 0.00859188474714756, LR: 0.0005 Step 550/1000, Loss: 0.008564407005906105, LR: 0.0005 Step 551/1000, Loss: 0.008537070825695992, LR: 0.0005 Step 552/1000, Loss: 0.00850987620651722, LR: 0.0005 Step 553/1000, Loss: 0.008482824079692364, LR: 0.0005 Step 554/1000, Loss: 0.008455904200673103, LR: 0.0005 Step 555/1000, Loss: 0.008429121226072311, LR: 0.0005 Step 556/1000, Loss: 0.008402475155889988, LR: 0.0005 Step 557/1000, Loss: 0.008375967852771282, LR: 0.0005 Step 558/1000, Loss: 0.008349591866135597, LR: 0.0005 Step 559/1000, Loss: 0.008323350921273232, LR: 0.0005 Step 560/1000, Loss: 0.008297242224216461, LR: 0.0005 Step 561/1000, Loss: 0.008271262049674988, LR: 0.0005 Step 562/1000, Loss: 0.008245415054261684, LR: 0.0005 Step 563/1000, Loss: 0.008219690062105656, LR: 0.0005 Step 564/1000, Loss: 0.008194111287593842, LR: 0.0005 Step 565/1000, Loss: 0.008168643340468407, LR: 0.0005 Step 566/1000, Loss: 0.008143315091729164, LR: 0.0005 Step 567/1000, Loss: 0.008118102326989174, LR: 0.0005 Step 568/1000, Loss: 0.00809301808476448, LR: 0.0005 Step 569/1000, Loss: 0.008068053051829338, LR: 0.0005 Step 570/1000, Loss: 0.00804322212934494, LR: 0.0005 Step 571/1000, Loss: 0.008018498308956623, LR: 0.0005 Step 572/1000, Loss: 0.007993906736373901, LR: 0.0005 Step 573/1000, Loss: 0.007969443686306477, LR: 0.0005 Step 574/1000, Loss: 0.007945085875689983, LR: 0.0005 Step 575/1000, Loss: 0.007920843549072742, LR: 0.0005 Step 576/1000, Loss: 0.007896732538938522, LR: 0.0005 Step 577/1000, Loss: 0.00787273421883583, LR: 0.0005 Step 578/1000, Loss: 0.007848854176700115, LR: 0.0005 Step 579/1000, Loss: 0.007825089618563652, LR: 0.0005 Step 580/1000, Loss: 0.007801445666700602, LR: 0.0005 Step 581/1000, Loss: 0.007777903228998184, LR: 0.0005 Step 582/1000, Loss: 0.007754480931907892, LR: 0.0005 Step 583/1000, Loss: 0.007731170859187841, LR: 0.0005 Step 584/1000, Loss: 0.007707974873483181, LR: 0.0005 Step 585/1000, Loss: 0.007684876210987568, LR: 0.0005 Step 586/1000, Loss: 0.007661907933652401, LR: 0.0005 Step 587/1000, Loss: 0.0076390416361391544, LR: 0.0005 Step 588/1000, Loss: 0.0076162852346897125, LR: 0.0005 Step 589/1000, Loss: 0.0075936331413686275, LR: 0.0005 Step 590/1000, Loss: 0.007571092341095209, LR: 0.0005 Step 591/1000, Loss: 0.007548660039901733, LR: 0.0005 Step 592/1000, Loss: 0.007526332046836615, LR: 0.0005 Step 593/1000, Loss: 0.0075041004456579685, LR: 0.0005 Step 594/1000, Loss: 0.007481986191123724, LR: 0.0005 Step 595/1000, Loss: 0.007459969259798527, LR: 0.0005 Step 596/1000, Loss: 0.0074380626901984215, LR: 0.0005 Step 597/1000, Loss: 0.007416248321533203, LR: 0.0005 Step 598/1000, Loss: 0.007394538726657629, LR: 0.0005 Step 599/1000, Loss: 0.0073729343712329865, LR: 0.0005 Step 600/1000, Loss: 0.007351431995630264, LR: 0.0005 Step 601/1000, Loss: 0.007330027408897877, LR: 0.0005 Step 602/1000, Loss: 0.0073087154887616634, LR: 0.0005 Step 603/1000, Loss: 0.007287511136382818, LR: 0.0005 Step 604/1000, Loss: 0.007266400847584009, LR: 0.0005 Step 605/1000, Loss: 0.007245390675961971, LR: 0.0005 Step 606/1000, Loss: 0.007224473170936108, LR: 0.0005 Step 607/1000, Loss: 0.007203653454780579, LR: 0.0005 Step 608/1000, Loss: 0.0071829394437372684, LR: 0.0005 Step 609/1000, Loss: 0.0071622999384999275, LR: 0.0005 Step 610/1000, Loss: 0.007141781039535999, LR: 0.0005 Step 611/1000, Loss: 0.007121329661458731, LR: 0.0005 Step 612/1000, Loss: 0.007100989110767841, LR: 0.0005 Step 613/1000, Loss: 0.007080732379108667, LR: 0.0005 Step 614/1000, Loss: 0.00706056784838438, LR: 0.0005 Step 615/1000, Loss: 0.00704049551859498, LR: 0.0005 Step 616/1000, Loss: 0.0070205144584178925, LR: 0.0005 Step 617/1000, Loss: 0.0070006223395466805, LR: 0.0005 Step 618/1000, Loss: 0.006980816833674908, LR: 0.0005 Step 619/1000, Loss: 0.006961110047996044, LR: 0.0005 Step 620/1000, Loss: 0.006941494531929493, LR: 0.0005 Step 621/1000, Loss: 0.006921953521668911, LR: 0.0005 Step 622/1000, Loss: 0.0069025056436657906, LR: 0.0005 Step 623/1000, Loss: 0.0068831490352749825, LR: 0.0005 Step 624/1000, Loss: 0.006863863673061132, LR: 0.0005 Step 625/1000, Loss: 0.00684467563405633, LR: 0.0005 Step 626/1000, Loss: 0.006825575139373541, LR: 0.0005 Step 627/1000, Loss: 0.006806555204093456, LR: 0.0005 Step 628/1000, Loss: 0.006787620484828949, LR: 0.0005 Step 629/1000, Loss: 0.006768770515918732, LR: 0.0005 Step 630/1000, Loss: 0.006750001106411219, LR: 0.0005 Step 631/1000, Loss: 0.006731316447257996, LR: 0.0005 Step 632/1000, Loss: 0.006712707225233316, LR: 0.0005 Step 633/1000, Loss: 0.006694186478853226, LR: 0.0005 Step 634/1000, Loss: 0.006675742566585541, LR: 0.0005 Step 635/1000, Loss: 0.006657374557107687, LR: 0.0005 Step 636/1000, Loss: 0.006639101542532444, LR: 0.0005 Step 637/1000, Loss: 0.006620905362069607, LR: 0.0005 Step 638/1000, Loss: 0.006602780427783728, LR: 0.0005 Step 639/1000, Loss: 0.006584735121577978, LR: 0.0005 Step 640/1000, Loss: 0.006566768046468496, LR: 0.0005 Step 641/1000, Loss: 0.006548880599439144, LR: 0.0005 Step 642/1000, Loss: 0.006531073246151209, LR: 0.0005 Step 643/1000, Loss: 0.00651333574205637, LR: 0.0005 Step 644/1000, Loss: 0.006495679263025522, LR: 0.0005 Step 645/1000, Loss: 0.006478097289800644, LR: 0.0005 Step 646/1000, Loss: 0.006460592150688171, LR: 0.0005 Step 647/1000, Loss: 0.006443151738494635, LR: 0.0005 Step 648/1000, Loss: 0.006425797939300537, LR: 0.0005 Step 649/1000, Loss: 0.00640850979834795, LR: 0.0005 Step 650/1000, Loss: 0.00639130175113678, LR: 0.0005 Step 651/1000, Loss: 0.0063741737976670265, LR: 0.0005 Step 652/1000, Loss: 0.006357112433761358, LR: 0.0005 Step 653/1000, Loss: 0.0063401153311133385, LR: 0.0005 Step 654/1000, Loss: 0.006323198787868023, LR: 0.0005 Step 655/1000, Loss: 0.006306345574557781, LR: 0.0005 Step 656/1000, Loss: 0.00628957012668252, LR: 0.0005 Step 657/1000, Loss: 0.006272861268371344, LR: 0.0005 Step 658/1000, Loss: 0.006256225518882275, LR: 0.0005 Step 659/1000, Loss: 0.006239654030650854, LR: 0.0005 Step 660/1000, Loss: 0.006223163101822138, LR: 0.0005 Step 661/1000, Loss: 0.006206731777638197, LR: 0.0005 Step 662/1000, Loss: 0.006190373562276363, LR: 0.0005 Step 663/1000, Loss: 0.006174078676849604, LR: 0.0005 Step 664/1000, Loss: 0.006157855037599802, LR: 0.0005 Step 665/1000, Loss: 0.006141699850559235, LR: 0.0005 Step 666/1000, Loss: 0.0061256168410182, LR: 0.0005 Step 667/1000, Loss: 0.006109591107815504, LR: 0.0005 Step 668/1000, Loss: 0.0060936277732253075, LR: 0.0005 Step 669/1000, Loss: 0.006077745463699102, LR: 0.0005 Step 670/1000, Loss: 0.006061912514269352, LR: 0.0005 Step 671/1000, Loss: 0.006046155001968145, LR: 0.0005 Step 672/1000, Loss: 0.006030459888279438, LR: 0.0005 Step 673/1000, Loss: 0.006014828570187092, LR: 0.0005 Step 674/1000, Loss: 0.005999256391078234, LR: 0.0005 Step 675/1000, Loss: 0.005983751732856035, LR: 0.0005 Step 676/1000, Loss: 0.0059683071449398994, LR: 0.0005 Step 677/1000, Loss: 0.005952931009232998, LR: 0.0005 Step 678/1000, Loss: 0.005937611218541861, LR: 0.0005 Step 679/1000, Loss: 0.005922363139688969, LR: 0.0005 Step 680/1000, Loss: 0.005907172337174416, LR: 0.0005 Step 681/1000, Loss: 0.005892039742320776, LR: 0.0005 Step 682/1000, Loss: 0.005876975599676371, LR: 0.0005 Step 683/1000, Loss: 0.005861960351467133, LR: 0.0005 Step 684/1000, Loss: 0.005847008898854256, LR: 0.0005 Step 685/1000, Loss: 0.0058321263641119, LR: 0.0005 Step 686/1000, Loss: 0.005817302968353033, LR: 0.0005 Step 687/1000, Loss: 0.005802526138722897, LR: 0.0005 Step 688/1000, Loss: 0.005787815898656845, LR: 0.0005 Step 689/1000, Loss: 0.0057731689885258675, LR: 0.0005 Step 690/1000, Loss: 0.005758573766797781, LR: 0.0005 Step 691/1000, Loss: 0.005744033958762884, LR: 0.0005 Step 692/1000, Loss: 0.005729551427066326, LR: 0.0005 Step 693/1000, Loss: 0.0057151298969984055, LR: 0.0005 Step 694/1000, Loss: 0.0057007670402526855, LR: 0.0005 Step 695/1000, Loss: 0.00568646565079689, LR: 0.0005 Step 696/1000, Loss: 0.00567221874371171, LR: 0.0005 Step 697/1000, Loss: 0.005658020731061697, LR: 0.0005 Step 698/1000, Loss: 0.005643889773637056, LR: 0.0005 Step 699/1000, Loss: 0.0056297993287444115, LR: 0.0005 Step 700/1000, Loss: 0.005615774542093277, LR: 0.0005 Step 701/1000, Loss: 0.005601801909506321, LR: 0.0005 Step 702/1000, Loss: 0.0055878860875964165, LR: 0.0005 Step 703/1000, Loss: 0.005574020557105541, LR: 0.0005 Step 704/1000, Loss: 0.005560219753533602, LR: 0.0005 Step 705/1000, Loss: 0.0055464571341872215, LR: 0.0005 Step 706/1000, Loss: 0.005532749928534031, LR: 0.0005 Step 707/1000, Loss: 0.005519102327525616, LR: 0.0005 Step 708/1000, Loss: 0.005505514796823263, LR: 0.0005 Step 709/1000, Loss: 0.005491972900927067, LR: 0.0005 Step 710/1000, Loss: 0.005478481762111187, LR: 0.0005 Step 711/1000, Loss: 0.005465040914714336, LR: 0.0005 Step 712/1000, Loss: 0.005451650358736515, LR: 0.0005 Step 713/1000, Loss: 0.00543831754475832, LR: 0.0005 Step 714/1000, Loss: 0.005425030831247568, LR: 0.0005 Step 715/1000, Loss: 0.005411794874817133, LR: 0.0005 Step 716/1000, Loss: 0.0053986175917088985, LR: 0.0005 Step 717/1000, Loss: 0.005385482218116522, LR: 0.0005 Step 718/1000, Loss: 0.005372397601604462, LR: 0.0005 Step 719/1000, Loss: 0.00535936513915658, LR: 0.0005 Step 720/1000, Loss: 0.005346388556063175, LR: 0.0005 Step 721/1000, Loss: 0.005333453416824341, LR: 0.0005 Step 722/1000, Loss: 0.0053205667063593864, LR: 0.0005 Step 723/1000, Loss: 0.0053077321499586105, LR: 0.0005 Step 724/1000, Loss: 0.0052949427627027035, LR: 0.0005 Step 725/1000, Loss: 0.005282202735543251, LR: 0.0005 Step 726/1000, Loss: 0.005269509740173817, LR: 0.0005 Step 727/1000, Loss: 0.005256865173578262, LR: 0.0005 Step 728/1000, Loss: 0.005244269035756588, LR: 0.0005 Step 729/1000, Loss: 0.005231723189353943, LR: 0.0005 Step 730/1000, Loss: 0.005219218786805868, LR: 0.0005 Step 731/1000, Loss: 0.005206763744354248, LR: 0.0005 Step 732/1000, Loss: 0.005194360390305519, LR: 0.0005 Step 733/1000, Loss: 0.005181995220482349, LR: 0.0005 Step 734/1000, Loss: 0.00516967847943306, LR: 0.0005 Step 735/1000, Loss: 0.005157409701496363, LR: 0.0005 Step 736/1000, Loss: 0.0051451874896883965, LR: 0.0005 Step 737/1000, Loss: 0.005133005324751139, LR: 0.0005 Step 738/1000, Loss: 0.00512087345123291, LR: 0.0005 Step 739/1000, Loss: 0.005108783021569252, LR: 0.0005 Step 740/1000, Loss: 0.005096740555018187, LR: 0.0005 Step 741/1000, Loss: 0.005084737669676542, LR: 0.0005 Step 742/1000, Loss: 0.005072786472737789, LR: 0.0005 Step 743/1000, Loss: 0.005060871597379446, LR: 0.0005 Step 744/1000, Loss: 0.0050490060821175575, LR: 0.0005 Step 745/1000, Loss: 0.005037182010710239, LR: 0.0005 Step 746/1000, Loss: 0.005025395657867193, LR: 0.0005 Step 747/1000, Loss: 0.005013664253056049, LR: 0.0005 Step 748/1000, Loss: 0.005001957528293133, LR: 0.0005 Step 749/1000, Loss: 0.004990304354578257, LR: 0.0005 Step 750/1000, Loss: 0.004978703800588846, LR: 0.0005 Step 751/1000, Loss: 0.00496713537722826, LR: 0.0005 Step 752/1000, Loss: 0.004955610726028681, LR: 0.0005 Step 753/1000, Loss: 0.004944118205457926, LR: 0.0005 Step 754/1000, Loss: 0.004932672716677189, LR: 0.0005 Step 755/1000, Loss: 0.0049212719313800335, LR: 0.0005 Step 756/1000, Loss: 0.004909916780889034, LR: 0.0005 Step 757/1000, Loss: 0.004898594226688147, LR: 0.0005 Step 758/1000, Loss: 0.004887314047664404, LR: 0.0005 Step 759/1000, Loss: 0.004876072518527508, LR: 0.0005 Step 760/1000, Loss: 0.00486487802118063, LR: 0.0005 Step 761/1000, Loss: 0.004853720776736736, LR: 0.0005 Step 762/1000, Loss: 0.004842604510486126, LR: 0.0005 Step 763/1000, Loss: 0.004831527825444937, LR: 0.0005 Step 764/1000, Loss: 0.004820483736693859, LR: 0.0005 Step 765/1000, Loss: 0.0048094866797327995, LR: 0.0005 Step 766/1000, Loss: 0.004798525478690863, LR: 0.0005 Step 767/1000, Loss: 0.004787603858858347, LR: 0.0005 Step 768/1000, Loss: 0.004776715766638517, LR: 0.0005 Step 769/1000, Loss: 0.004765878431499004, LR: 0.0005 Step 770/1000, Loss: 0.004755067173391581, LR: 0.0005 Step 771/1000, Loss: 0.004744302947074175, LR: 0.0005 Step 772/1000, Loss: 0.004733569920063019, LR: 0.0005 Step 773/1000, Loss: 0.004722872748970985, LR: 0.0005 Step 774/1000, Loss: 0.004712224937975407, LR: 0.0005 Step 775/1000, Loss: 0.004701610654592514, LR: 0.0005 Step 776/1000, Loss: 0.00469102943316102, LR: 0.0005 Step 777/1000, Loss: 0.004680488258600235, LR: 0.0005 Step 778/1000, Loss: 0.004669980611652136, LR: 0.0005 Step 779/1000, Loss: 0.00465951394289732, LR: 0.0005 Step 780/1000, Loss: 0.004649073351174593, LR: 0.0005 Step 781/1000, Loss: 0.00463868398219347, LR: 0.0005 Step 782/1000, Loss: 0.004628323018550873, LR: 0.0005 Step 783/1000, Loss: 0.004617996513843536, LR: 0.0005 Step 784/1000, Loss: 0.004607713781297207, LR: 0.0005 Step 785/1000, Loss: 0.004597459454089403, LR: 0.0005 Step 786/1000, Loss: 0.004587238188832998, LR: 0.0005 Step 787/1000, Loss: 0.004577059298753738, LR: 0.0005 Step 788/1000, Loss: 0.004566916264593601, LR: 0.0005 Step 789/1000, Loss: 0.0045567965134978294, LR: 0.0005 Step 790/1000, Loss: 0.004546718206256628, LR: 0.0005 Step 791/1000, Loss: 0.004536682274192572, LR: 0.0005 Step 792/1000, Loss: 0.0045266770757734776, LR: 0.0005 Step 793/1000, Loss: 0.004516695160418749, LR: 0.0005 Step 794/1000, Loss: 0.004506758414208889, LR: 0.0005 Step 795/1000, Loss: 0.004496856592595577, LR: 0.0005 Step 796/1000, Loss: 0.004486984573304653, LR: 0.0005 Step 797/1000, Loss: 0.004477140959352255, LR: 0.0005 Step 798/1000, Loss: 0.004467337392270565, LR: 0.0005 Step 799/1000, Loss: 0.004457566887140274, LR: 0.0005 Step 800/1000, Loss: 0.004447835963219404, LR: 0.0005 Step 801/1000, Loss: 0.004438124597072601, LR: 0.0005 Step 802/1000, Loss: 0.004428455606102943, LR: 0.0005 Step 803/1000, Loss: 0.0044188122265040874, LR: 0.0005 Step 804/1000, Loss: 0.004409208428114653, LR: 0.0005 Step 805/1000, Loss: 0.004399636760354042, LR: 0.0005 Step 806/1000, Loss: 0.004390091635286808, LR: 0.0005 Step 807/1000, Loss: 0.00438058003783226, LR: 0.0005 Step 808/1000, Loss: 0.0043711126782000065, LR: 0.0005 Step 809/1000, Loss: 0.004361661616712809, LR: 0.0005 Step 810/1000, Loss: 0.004352245479822159, LR: 0.0005 Step 811/1000, Loss: 0.004342859145253897, LR: 0.0005 Step 812/1000, Loss: 0.004333512391895056, LR: 0.0005 Step 813/1000, Loss: 0.0043241968378424644, LR: 0.0005 Step 814/1000, Loss: 0.004314909223467112, LR: 0.0005 Step 815/1000, Loss: 0.004305648151785135, LR: 0.0005 Step 816/1000, Loss: 0.004296420607715845, LR: 0.0005 Step 817/1000, Loss: 0.004287217743694782, LR: 0.0005 Step 818/1000, Loss: 0.004278055392205715, LR: 0.0005 Step 819/1000, Loss: 0.0042689209803938866, LR: 0.0005 Step 820/1000, Loss: 0.004259810782968998, LR: 0.0005 Step 821/1000, Loss: 0.004250743426382542, LR: 0.0005 Step 822/1000, Loss: 0.004241696558892727, LR: 0.0005 Step 823/1000, Loss: 0.00423267250880599, LR: 0.0005 Step 824/1000, Loss: 0.004223690368235111, LR: 0.0005 Step 825/1000, Loss: 0.00421473104506731, LR: 0.0005 Step 826/1000, Loss: 0.0042058005928993225, LR: 0.0005 Step 827/1000, Loss: 0.0041969045996665955, LR: 0.0005 Step 828/1000, Loss: 0.004188031889498234, LR: 0.0005 Step 829/1000, Loss: 0.004179197363555431, LR: 0.0005 Step 830/1000, Loss: 0.004170377738773823, LR: 0.0005 Step 831/1000, Loss: 0.004161602817475796, LR: 0.0005 Step 832/1000, Loss: 0.00415284838527441, LR: 0.0005 Step 833/1000, Loss: 0.004144120030105114, LR: 0.0005 Step 834/1000, Loss: 0.004135423339903355, LR: 0.0005 Step 835/1000, Loss: 0.0041267527267336845, LR: 0.0005 Step 836/1000, Loss: 0.004118111915886402, LR: 0.0005 Step 837/1000, Loss: 0.004109499976038933, LR: 0.0005 Step 838/1000, Loss: 0.004100920632481575, LR: 0.0005 Step 839/1000, Loss: 0.004092366900295019, LR: 0.0005 Step 840/1000, Loss: 0.004083829931914806, LR: 0.0005 Step 841/1000, Loss: 0.004075332544744015, LR: 0.0005 Step 842/1000, Loss: 0.0040668584406375885, LR: 0.0005 Step 843/1000, Loss: 0.004058408550918102, LR: 0.0005 Step 844/1000, Loss: 0.004049979615956545, LR: 0.0005 Step 845/1000, Loss: 0.004041589796543121, LR: 0.0005 Step 846/1000, Loss: 0.0040332237258553505, LR: 0.0005 Step 847/1000, Loss: 0.004024882800877094, LR: 0.0005 Step 848/1000, Loss: 0.004016571678221226, LR: 0.0005 Step 849/1000, Loss: 0.0040082866325974464, LR: 0.0005 Step 850/1000, Loss: 0.004000022076070309, LR: 0.0005 Step 851/1000, Loss: 0.003991791047155857, LR: 0.0005 Step 852/1000, Loss: 0.003983574919402599, LR: 0.0005 Step 853/1000, Loss: 0.003975397441536188, LR: 0.0005 Step 854/1000, Loss: 0.003967237658798695, LR: 0.0005 Step 855/1000, Loss: 0.003959109075367451, LR: 0.0005 Step 856/1000, Loss: 0.003951003309339285, LR: 0.0005 Step 857/1000, Loss: 0.00394292501732707, LR: 0.0005 Step 858/1000, Loss: 0.003934868611395359, LR: 0.0005 Step 859/1000, Loss: 0.003926844336092472, LR: 0.0005 Step 860/1000, Loss: 0.003918840084224939, LR: 0.0005 Step 861/1000, Loss: 0.003910858649760485, LR: 0.0005 Step 862/1000, Loss: 0.003902912838384509, LR: 0.0005 Step 863/1000, Loss: 0.0038949833251535892, LR: 0.0005 Step 864/1000, Loss: 0.0038870838470757008, LR: 0.0005 Step 865/1000, Loss: 0.0038792050909250975, LR: 0.0005 Step 866/1000, Loss: 0.0038713521789759398, LR: 0.0005 Step 867/1000, Loss: 0.0038635204546153545, LR: 0.0005 Step 868/1000, Loss: 0.003855716437101364, LR: 0.0005 Step 869/1000, Loss: 0.003847935702651739, LR: 0.0005 Step 870/1000, Loss: 0.0038401789497584105, LR: 0.0005 Step 871/1000, Loss: 0.003832441521808505, LR: 0.0005 Step 872/1000, Loss: 0.0038247413467615843, LR: 0.0005 Step 873/1000, Loss: 0.0038170539774000645, LR: 0.0005 Step 874/1000, Loss: 0.0038093936163932085, LR: 0.0005 Step 875/1000, Loss: 0.0038017607294023037, LR: 0.0005 Step 876/1000, Loss: 0.003794149961322546, LR: 0.0005 Step 877/1000, Loss: 0.003786554094403982, LR: 0.0005 Step 878/1000, Loss: 0.0037789822090417147, LR: 0.0005 Step 879/1000, Loss: 0.00377144617959857, LR: 0.0005 Step 880/1000, Loss: 0.003763922257348895, LR: 0.0005 Step 881/1000, Loss: 0.0037564232479780912, LR: 0.0005 Step 882/1000, Loss: 0.003748946590349078, LR: 0.0005 Step 883/1000, Loss: 0.0037414957769215107, LR: 0.0005 Step 884/1000, Loss: 0.003734070807695389, LR: 0.0005 Step 885/1000, Loss: 0.0037266656290739775, LR: 0.0005 Step 886/1000, Loss: 0.003719282103702426, LR: 0.0005 Step 887/1000, Loss: 0.0037119188345968723, LR: 0.0005 Step 888/1000, Loss: 0.003704572794958949, LR: 0.0005 Step 889/1000, Loss: 0.0036972600501030684, LR: 0.0005 Step 890/1000, Loss: 0.0036899708211421967, LR: 0.0005 Step 891/1000, Loss: 0.0036826960276812315, LR: 0.0005 Step 892/1000, Loss: 0.003675443585962057, LR: 0.0005 Step 893/1000, Loss: 0.0036682181525975466, LR: 0.0005 Step 894/1000, Loss: 0.0036610171664506197, LR: 0.0005 Step 895/1000, Loss: 0.0036538317799568176, LR: 0.0005 Step 896/1000, Loss: 0.003646660130470991, LR: 0.0005 Step 897/1000, Loss: 0.003639526665210724, LR: 0.0005 Step 898/1000, Loss: 0.0036323971580713987, LR: 0.0005 Step 899/1000, Loss: 0.003625305835157633, LR: 0.0005 Step 900/1000, Loss: 0.003618230577558279, LR: 0.0005 Step 901/1000, Loss: 0.0036111711524426937, LR: 0.0005 Step 902/1000, Loss: 0.0036041378043591976, LR: 0.0005 Step 903/1000, Loss: 0.003597124246880412, LR: 0.0005 Step 904/1000, Loss: 0.003590137232095003, LR: 0.0005 Step 905/1000, Loss: 0.003583161626011133, LR: 0.0005 Step 906/1000, Loss: 0.0035762074403464794, LR: 0.0005 Step 907/1000, Loss: 0.0035692814271897078, LR: 0.0005 Step 908/1000, Loss: 0.00356237031519413, LR: 0.0005 Step 909/1000, Loss: 0.0035554796922951937, LR: 0.0005 Step 910/1000, Loss: 0.003548609558492899, LR: 0.0005 Step 911/1000, Loss: 0.003541758982464671, LR: 0.0005 Step 912/1000, Loss: 0.0035349340178072453, LR: 0.0005 Step 913/1000, Loss: 0.0035281225573271513, LR: 0.0005 Step 914/1000, Loss: 0.0035213276278227568, LR: 0.0005 Step 915/1000, Loss: 0.0035145606379956007, LR: 0.0005 Step 916/1000, Loss: 0.0035078185610473156, LR: 0.0005 Step 917/1000, Loss: 0.003501088824123144, LR: 0.0005 Step 918/1000, Loss: 0.0034943842329084873, LR: 0.0005 Step 919/1000, Loss: 0.0034876905847340822, LR: 0.0005 Step 920/1000, Loss: 0.003481017891317606, LR: 0.0005 Step 921/1000, Loss: 0.003474373370409012, LR: 0.0005 Step 922/1000, Loss: 0.0034677416551858187, LR: 0.0005 Step 923/1000, Loss: 0.00346113252453506, LR: 0.0005 Step 924/1000, Loss: 0.0034545338712632656, LR: 0.0005 Step 925/1000, Loss: 0.0034479654859751463, LR: 0.0005 Step 926/1000, Loss: 0.003441412001848221, LR: 0.0005 Step 927/1000, Loss: 0.003434880869463086, LR: 0.0005 Step 928/1000, Loss: 0.0034283646382391453, LR: 0.0005 Step 929/1000, Loss: 0.00342186470516026, LR: 0.0005 Step 930/1000, Loss: 0.003415388520807028, LR: 0.0005 Step 931/1000, Loss: 0.003408926073461771, LR: 0.0005 Step 932/1000, Loss: 0.0034024883061647415, LR: 0.0005 Step 933/1000, Loss: 0.0033960696309804916, LR: 0.0005 Step 934/1000, Loss: 0.0033896665554493666, LR: 0.0005 Step 935/1000, Loss: 0.0033832855988293886, LR: 0.0005 Step 936/1000, Loss: 0.003376914653927088, LR: 0.0005 Step 937/1000, Loss: 0.0033705648966133595, LR: 0.0005 Step 938/1000, Loss: 0.003364233300089836, LR: 0.0005 Step 939/1000, Loss: 0.003357923123985529, LR: 0.0005 Step 940/1000, Loss: 0.003351630177348852, LR: 0.0005 Step 941/1000, Loss: 0.003345354925841093, LR: 0.0005 Step 942/1000, Loss: 0.003339096438139677, LR: 0.0005 Step 943/1000, Loss: 0.0033328577410429716, LR: 0.0005 Step 944/1000, Loss: 0.0033266418613493443, LR: 0.0005 Step 945/1000, Loss: 0.0033204425126314163, LR: 0.0005 Step 946/1000, Loss: 0.0033142506144940853, LR: 0.0005 Step 947/1000, Loss: 0.0033080782741308212, LR: 0.0005 Step 948/1000, Loss: 0.003301924094557762, LR: 0.0005 Step 949/1000, Loss: 0.0032957918010652065, LR: 0.0005 Step 950/1000, Loss: 0.003289679531008005, LR: 0.0005 Step 951/1000, Loss: 0.003283575875684619, LR: 0.0005 Step 952/1000, Loss: 0.0032774941064417362, LR: 0.0005 Step 953/1000, Loss: 0.0032714330591261387, LR: 0.0005 Step 954/1000, Loss: 0.0032653820235282183, LR: 0.0005 Step 955/1000, Loss: 0.0032593528740108013, LR: 0.0005 Step 956/1000, Loss: 0.0032533383928239346, LR: 0.0005 Step 957/1000, Loss: 0.003247339278459549, LR: 0.0005 Step 958/1000, Loss: 0.003241369966417551, LR: 0.0005 Step 959/1000, Loss: 0.003235395299270749, LR: 0.0005 Step 960/1000, Loss: 0.0032294518314301968, LR: 0.0005 Step 961/1000, Loss: 0.0032235246617347, LR: 0.0005 Step 962/1000, Loss: 0.003217612160369754, LR: 0.0005 Step 963/1000, Loss: 0.003211710602045059, LR: 0.0005 Step 964/1000, Loss: 0.003205829067155719, LR: 0.0005 Step 965/1000, Loss: 0.003199969185516238, LR: 0.0005 Step 966/1000, Loss: 0.003194121178239584, LR: 0.0005 Step 967/1000, Loss: 0.00318829040043056, LR: 0.0005 Step 968/1000, Loss: 0.0031824796460568905, LR: 0.0005 Step 969/1000, Loss: 0.0031766779720783234, LR: 0.0005 Step 970/1000, Loss: 0.003170899348333478, LR: 0.0005 Step 971/1000, Loss: 0.0031651295721530914, LR: 0.0005 Step 972/1000, Loss: 0.0031593807507306337, LR: 0.0005 Step 973/1000, Loss: 0.0031536526512354612, LR: 0.0005 Step 974/1000, Loss: 0.0031479306053370237, LR: 0.0005 Step 975/1000, Loss: 0.0031422239262610674, LR: 0.0005 Step 976/1000, Loss: 0.0031365402974188328, LR: 0.0005 Step 977/1000, Loss: 0.0031308692414313555, LR: 0.0005 Step 978/1000, Loss: 0.0031252100598067045, LR: 0.0005 Step 979/1000, Loss: 0.003119572065770626, LR: 0.0005 Step 980/1000, Loss: 0.003113951999694109, LR: 0.0005 Step 981/1000, Loss: 0.0031083361245691776, LR: 0.0005 Step 982/1000, Loss: 0.00310274469666183, LR: 0.0005 Step 983/1000, Loss: 0.0030971714295446873, LR: 0.0005 Step 984/1000, Loss: 0.003091600490733981, LR: 0.0005 Step 985/1000, Loss: 0.0030860528349876404, LR: 0.0005 Step 986/1000, Loss: 0.0030805172864347696, LR: 0.0005 Step 987/1000, Loss: 0.0030750036239624023, LR: 0.0005 Step 988/1000, Loss: 0.0030694985762238503, LR: 0.0005 Step 989/1000, Loss: 0.0030640107579529285, LR: 0.0005 Step 990/1000, Loss: 0.0030585364438593388, LR: 0.0005 Step 991/1000, Loss: 0.0030530826188623905, LR: 0.0005 Step 992/1000, Loss: 0.0030476334504783154, LR: 0.0005 Step 993/1000, Loss: 0.003042203839868307, LR: 0.0005 Step 994/1000, Loss: 0.0030367891304194927, LR: 0.0005 Step 995/1000, Loss: 0.0030313883908092976, LR: 0.0005 Step 996/1000, Loss: 0.003025999991223216, LR: 0.0005 Step 997/1000, Loss: 0.003020629985257983, LR: 0.0005 Step 998/1000, Loss: 0.003015281166881323, LR: 0.0005 Step 999/1000, Loss: 0.003009930718690157, LR: 0.0005 Step 1000/1000, Loss: 0.003004607977345586, LR: 0.0005
plt.plot(losses)[<matplotlib.lines.Line2D at 0x3315e5c50>]
To perform inference, we can autoregressively feed data into the transformer, sliding the selected output token back into the input. We can test this on one of our training examples and see that our model is accurately reproducing the training example. The model has been overfit to the data, so we are testing if the model reproduces the correct outputs in the same order as the inputs.
def inference(prompt, max_new_tokens):
tokens = tokenizer.encode(prompt)
for _ in range(max_new_tokens):
num_tokens = len(tokens)
tokens_padded = tokens + [tokenizer.eot_token] * (config.seq_len - num_tokens)
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = model(tokens_padded)
predicted_token = torch.argmax(logits[0, num_tokens-1, :]).item()
tokens.append(predicted_token)
return tokenizer.decode(tokens)
print("Original: ", tokenizer.decode(train_inputs[2].tolist())[:90])
print("Predicted:", inference(" director Takeshi Ozawa . A large team of writers handled the script", max_new_tokens=6))Original: director Takeshi Ozawa . A large team of writers handled the script . The game 's opening Predicted: director Takeshi Ozawa . A large team of writers handled the script . The game 's opening
Using tiktoken, and a small dataset, we were able to overfit a small dataset and perform inference examples. However, in order to train a LLM that can do useful things we will need a larger dataset that won't be able to fit in memory. We will also need an efficient way to tokenize the dataset and load it into pytorch tensors.
Huggingface's datasets library makes this process very easy.
# Load dataset in streaming mode
ds = load_dataset("abisee/cnn_dailymail", "3.0.0", split="train")
hf_tokenizer = AutoTokenizer.from_pretrained("gpt2")
def check_dataset_exists():
try:
# Attempt to load the dataset with reuse_cache_if_exists mode
load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")
return True
except FileNotFoundError:
return False
if not check_dataset_exists():
print("Tokenized dataset does not exist locally... Generating and saving to disk.")
def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, train_rows=100_000, test_rows=500):
"""
Tokenizes and chunks the dataset into fixed-length 512-token segments.
The 'target' sequence is shifted left by 1 token.
Stops after generating `train_rows + test_rows` tokenized chunks.
"""
buffer = [] # Rolling buffer for tokens
row_count = 0
for example in dataset:
tokens = tokenizer(example["article"], truncation=False, padding=False)['input_ids']
buffer.extend(tokens)
# Yield full chunks until we reach train_rows + test_rows
while len(buffer) >= chunk_size + 1: # +1 to ensure we can shift target
if row_count >= (train_rows + test_rows):
return # Stop yielding once enough rows are reached
# Create input-target pairs
input_chunk = buffer[:chunk_size] # First 512 tokens
target_chunk = buffer[1:chunk_size + 1] # Shifted by 1 token
# Assign to train or test split
split = "train" if row_count < train_rows else "test"
yield {
"split": split,
"input": input_chunk,
"target": target_chunk
}
buffer = buffer[chunk_size:] # Remove used tokens
row_count += 1
# Set the max number of rows for training and testing
TRAIN_ROWS = 1400000 # Adjust as needed
TEST_ROWS = 500 # Adjust as needed
CHUNK_SIZE = 128
# Convert generator to a Hugging Face Dataset
tokenized_ds = Dataset.from_generator(lambda: tokenize_and_chunk(ds, hf_tokenizer,chunk_size=CHUNK_SIZE, train_rows=TRAIN_ROWS, test_rows=TEST_ROWS))
# Split the dataset into `train` and `test`
dataset_splits = tokenized_ds.train_test_split(test_size=TEST_ROWS / (TRAIN_ROWS + TEST_ROWS), seed=42)
# Save to disk
dataset_splits["train"].to_parquet("cnn_dailymail_train.parquet")
dataset_splits["test"].to_parquet("cnn_dailymail_test.parquet")
print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows.")
else:
print("Tokenized dataset already exists locally.")README.md: 0.00B [00:00, ?B/s]
3.0.0/train-00000-of-00003.parquet: 0%| | 0.00/257M [00:00<?, ?B/s]
3.0.0/train-00001-of-00003.parquet: 0%| | 0.00/257M [00:00<?, ?B/s]
3.0.0/train-00002-of-00003.parquet: 0%| | 0.00/259M [00:00<?, ?B/s]
3.0.0/validation-00000-of-00001.parquet: 0%| | 0.00/34.7M [00:00<?, ?B/s]
3.0.0/test-00000-of-00001.parquet: 0%| | 0.00/30.0M [00:00<?, ?B/s]
Generating train split: 0%| | 0/287113 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/13368 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/11490 [00:00<?, ? examples/s]
tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/665 [00:00<?, ?B/s]
vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]
merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]
Tokenized dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1194 > 1024). Running this sequence through the model will result in indexing errors
Creating parquet from Arrow format: 0%| | 0/29 [00:00<?, ?ba/s]
Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
✅ Saved 1400000 train rows and 500 test rows.
We have tokenized the dataset in chunks, and saved it to the disk as a parquet file. This is a scalable approach that will allow us to train the model while never having the entire dataset in memory. Let's make a more robust training loop that ensures we are saving off the model at various checkpoints.
# Example config:
batch_size = 64
sequence_len = 128
num_steps = 150000
accumulation_steps = 100
# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="cnn_dailymail_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="cnn_dailymail_test.parquet", split="train")
# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])
# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
config = GPTConfig(
vocab_size=hf_tokenizer.vocab_size,
n_layer=8, # fewer layers for a quick demo
n_head=8,
n_embd=128,
seq_len=sequence_len,
)
# Create the GPT model
model = GPTModel(config)
use_existing_model = os.path.exists("./pretrain_final.pth")
# Check if pre-trained model exists
if use_existing_model:
model = torch.load("./pretrain_final.pth", weights_only=False)
print("Loaded pre-trained model from ./pretrain_final.pth, skipping training loop.")
else:
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Define Scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.3, patience=10, min_lr=5e-6, threshold=1e-4)
# Training loop
losses = []
test_losses = []
accumulator = 0
accumulator_loss = 0
start_time = time.time()
for i in range(num_steps):
model.train()
example = next(train_dataloader)
train_input = example["input"].to(device)
train_target = example["target"].to(device)
logits = model(train_input)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights
optimizer.step()
optimizer.zero_grad()
accumulator += 1
accumulator_loss += loss.item()
if accumulator >= accumulation_steps:
losses.append(accumulator_loss / accumulation_steps)
accumulator = 0
accumulator_loss = 0
model.eval()
test_loss = 0
test_accumulator = 0
with torch.no_grad():
for test_example in test_dataloader:
test_input = test_example["input"].to(device)
test_target = test_example["target"].to(device)
test_logits = model(test_input)
test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
test_accumulator += 1
test_losses.append(test_loss / test_accumulator)
elapsed_time = time.time() - start_time
print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}, LR: {optimizer.param_groups[0]['lr']}, Elapsed Time: {elapsed_time:.2f} seconds")
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
scheduler.step(test_losses[-1])
if (i+1) % 50000 == 0:
# Save the model checkpoint
print(f"Saving model checkpoint at step {i+1}")
torch.save(model, f"./model_checkpoint_{i}.pt")
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Step 100/150000, Loss: 8.29131314754486, Test Loss: 7.32573276758194, LR: 0.0005, Elapsed Time: 2.37 seconds Step 200/150000, Loss: 7.117442603111267, Test Loss: 6.908363699913025, LR: 0.0005, Elapsed Time: 4.71 seconds Step 300/150000, Loss: 6.772330284118652, Test Loss: 6.633603274822235, LR: 0.0005, Elapsed Time: 7.06 seconds Step 400/150000, Loss: 6.547142271995544, Test Loss: 6.444078981876373, LR: 0.0005, Elapsed Time: 9.42 seconds Step 500/150000, Loss: 6.379139184951782, Test Loss: 6.298685610294342, LR: 0.0005, Elapsed Time: 11.76 seconds Step 600/150000, Loss: 6.24203019618988, Test Loss: 6.186869561672211, LR: 0.0005, Elapsed Time: 14.16 seconds Step 700/150000, Loss: 6.142061977386475, Test Loss: 6.080222308635712, LR: 0.0005, Elapsed Time: 16.54 seconds Step 800/150000, Loss: 6.046715126037598, Test Loss: 5.995494902133942, LR: 0.0005, Elapsed Time: 18.92 seconds Step 900/150000, Loss: 5.967564377784729, Test Loss: 5.926055014133453, LR: 0.0005, Elapsed Time: 21.27 seconds Step 1000/150000, Loss: 5.8898282241821285, Test Loss: 5.863428473472595, LR: 0.0005, Elapsed Time: 23.61 seconds Step 1100/150000, Loss: 5.845172233581543, Test Loss: 5.811928570270538, LR: 0.0005, Elapsed Time: 25.96 seconds Step 1200/150000, Loss: 5.782345604896546, Test Loss: 5.755714535713196, LR: 0.0005, Elapsed Time: 28.30 seconds Step 1300/150000, Loss: 5.727044267654419, Test Loss: 5.706365346908569, LR: 0.0005, Elapsed Time: 30.64 seconds Step 1400/150000, Loss: 5.693611726760865, Test Loss: 5.660225093364716, LR: 0.0005, Elapsed Time: 32.98 seconds Step 1500/150000, Loss: 5.664098086357117, Test Loss: 5.619187712669373, LR: 0.0005, Elapsed Time: 35.32 seconds Step 1600/150000, Loss: 5.59910029411316, Test Loss: 5.578723967075348, LR: 0.0005, Elapsed Time: 37.66 seconds Step 1700/150000, Loss: 5.576466584205628, Test Loss: 5.542311728000641, LR: 0.0005, Elapsed Time: 40.01 seconds Step 1800/150000, Loss: 5.544434690475464, Test Loss: 5.51231062412262, LR: 0.0005, Elapsed Time: 42.35 seconds Step 1900/150000, Loss: 5.49370548248291, Test Loss: 5.483141243457794, LR: 0.0005, Elapsed Time: 44.70 seconds Step 2000/150000, Loss: 5.473454585075379, Test Loss: 5.447272479534149, LR: 0.0005, Elapsed Time: 47.04 seconds Step 2100/150000, Loss: 5.453946571350098, Test Loss: 5.418991446495056, LR: 0.0005, Elapsed Time: 49.39 seconds Step 2200/150000, Loss: 5.409182085990905, Test Loss: 5.391051650047302, LR: 0.0005, Elapsed Time: 51.74 seconds Step 2300/150000, Loss: 5.390555815696716, Test Loss: 5.3657426834106445, LR: 0.0005, Elapsed Time: 54.10 seconds Step 2400/150000, Loss: 5.367695422172546, Test Loss: 5.34090256690979, LR: 0.0005, Elapsed Time: 56.46 seconds Step 2500/150000, Loss: 5.340488543510437, Test Loss: 5.318245351314545, LR: 0.0005, Elapsed Time: 58.80 seconds Step 2600/150000, Loss: 5.316046500205994, Test Loss: 5.295966148376465, LR: 0.0005, Elapsed Time: 61.15 seconds Step 2700/150000, Loss: 5.284513239860535, Test Loss: 5.27100795507431, LR: 0.0005, Elapsed Time: 63.49 seconds Step 2800/150000, Loss: 5.266137585639954, Test Loss: 5.260135233402252, LR: 0.0005, Elapsed Time: 65.84 seconds Step 2900/150000, Loss: 5.252311511039734, Test Loss: 5.238052546977997, LR: 0.0005, Elapsed Time: 68.20 seconds Step 3000/150000, Loss: 5.227418856620789, Test Loss: 5.220346033573151, LR: 0.0005, Elapsed Time: 70.55 seconds Step 3100/150000, Loss: 5.21288724899292, Test Loss: 5.200305819511414, LR: 0.0005, Elapsed Time: 72.92 seconds Step 3200/150000, Loss: 5.2047114753723145, Test Loss: 5.187534809112549, LR: 0.0005, Elapsed Time: 75.28 seconds Step 3300/150000, Loss: 5.183226270675659, Test Loss: 5.168719530105591, LR: 0.0005, Elapsed Time: 77.64 seconds Step 3400/150000, Loss: 5.169418730735779, Test Loss: 5.152598857879639, LR: 0.0005, Elapsed Time: 80.00 seconds Step 3500/150000, Loss: 5.161659555435181, Test Loss: 5.137952983379364, LR: 0.0005, Elapsed Time: 82.37 seconds Step 3600/150000, Loss: 5.146551513671875, Test Loss: 5.118983626365662, LR: 0.0005, Elapsed Time: 84.73 seconds Step 3700/150000, Loss: 5.125599775314331, Test Loss: 5.1149380803108215, LR: 0.0005, Elapsed Time: 87.09 seconds Step 3800/150000, Loss: 5.114699449539184, Test Loss: 5.094923257827759, LR: 0.0005, Elapsed Time: 89.46 seconds Step 3900/150000, Loss: 5.102765526771545, Test Loss: 5.0881869196891785, LR: 0.0005, Elapsed Time: 91.83 seconds Step 4000/150000, Loss: 5.089006838798523, Test Loss: 5.069235861301422, LR: 0.0005, Elapsed Time: 94.20 seconds Step 4100/150000, Loss: 5.055593185424804, Test Loss: 5.056783497333527, LR: 0.0005, Elapsed Time: 96.56 seconds Step 4200/150000, Loss: 5.056928520202637, Test Loss: 5.047412812709808, LR: 0.0005, Elapsed Time: 98.93 seconds Step 4300/150000, Loss: 5.054739732742309, Test Loss: 5.037115573883057, LR: 0.0005, Elapsed Time: 101.30 seconds Step 4400/150000, Loss: 5.037222838401794, Test Loss: 5.0259793400764465, LR: 0.0005, Elapsed Time: 103.66 seconds Step 4500/150000, Loss: 5.025108327865601, Test Loss: 5.016159236431122, LR: 0.0005, Elapsed Time: 106.03 seconds Step 4600/150000, Loss: 5.020468945503235, Test Loss: 5.0058212876319885, LR: 0.0005, Elapsed Time: 108.39 seconds Step 4700/150000, Loss: 5.004053258895874, Test Loss: 4.998083412647247, LR: 0.0005, Elapsed Time: 110.76 seconds Step 4800/150000, Loss: 5.007849006652832, Test Loss: 4.989589154720306, LR: 0.0005, Elapsed Time: 113.12 seconds Step 4900/150000, Loss: 4.992166042327881, Test Loss: 4.977273404598236, LR: 0.0005, Elapsed Time: 115.49 seconds Step 5000/150000, Loss: 4.968843655586243, Test Loss: 4.9670655727386475, LR: 0.0005, Elapsed Time: 117.85 seconds Step 5100/150000, Loss: 4.963251438140869, Test Loss: 4.9569602608680725, LR: 0.0005, Elapsed Time: 120.21 seconds Step 5200/150000, Loss: 4.962724876403809, Test Loss: 4.955780804157257, LR: 0.0005, Elapsed Time: 122.59 seconds Step 5300/150000, Loss: 4.94444486618042, Test Loss: 4.9460033774375916, LR: 0.0005, Elapsed Time: 124.96 seconds Step 5400/150000, Loss: 4.944139776229858, Test Loss: 4.934847712516785, LR: 0.0005, Elapsed Time: 127.33 seconds Step 5500/150000, Loss: 4.938820266723633, Test Loss: 4.927900075912476, LR: 0.0005, Elapsed Time: 129.69 seconds Step 5600/150000, Loss: 4.9369744777679445, Test Loss: 4.919080018997192, LR: 0.0005, Elapsed Time: 132.06 seconds Step 5700/150000, Loss: 4.914067401885986, Test Loss: 4.912622153759003, LR: 0.0005, Elapsed Time: 134.43 seconds Step 5800/150000, Loss: 4.901670680046082, Test Loss: 4.909731388092041, LR: 0.0005, Elapsed Time: 136.79 seconds Step 5900/150000, Loss: 4.9029511451721195, Test Loss: 4.900074243545532, LR: 0.0005, Elapsed Time: 139.15 seconds Step 6000/150000, Loss: 4.910959076881409, Test Loss: 4.892938494682312, LR: 0.0005, Elapsed Time: 141.52 seconds Step 6100/150000, Loss: 4.888806700706482, Test Loss: 4.88437157869339, LR: 0.0005, Elapsed Time: 143.88 seconds Step 6200/150000, Loss: 4.889922270774841, Test Loss: 4.875905513763428, LR: 0.0005, Elapsed Time: 146.25 seconds Step 6300/150000, Loss: 4.882483267784119, Test Loss: 4.867132604122162, LR: 0.0005, Elapsed Time: 148.62 seconds Step 6400/150000, Loss: 4.865064640045166, Test Loss: 4.86199426651001, LR: 0.0005, Elapsed Time: 151.00 seconds Step 6500/150000, Loss: 4.86213840007782, Test Loss: 4.859926342964172, LR: 0.0005, Elapsed Time: 153.37 seconds Step 6600/150000, Loss: 4.864676365852356, Test Loss: 4.854439795017242, LR: 0.0005, Elapsed Time: 155.73 seconds Step 6700/150000, Loss: 4.8568338203430175, Test Loss: 4.84831690788269, LR: 0.0005, Elapsed Time: 158.10 seconds Step 6800/150000, Loss: 4.839257125854492, Test Loss: 4.841279983520508, LR: 0.0005, Elapsed Time: 160.48 seconds Step 6900/150000, Loss: 4.839060854911804, Test Loss: 4.835512042045593, LR: 0.0005, Elapsed Time: 162.87 seconds Step 7000/150000, Loss: 4.835203504562378, Test Loss: 4.827293813228607, LR: 0.0005, Elapsed Time: 165.25 seconds Step 7100/150000, Loss: 4.81978217124939, Test Loss: 4.821726858615875, LR: 0.0005, Elapsed Time: 167.63 seconds Step 7200/150000, Loss: 4.819846754074097, Test Loss: 4.812400221824646, LR: 0.0005, Elapsed Time: 170.00 seconds Step 7300/150000, Loss: 4.811172308921814, Test Loss: 4.806427240371704, LR: 0.0005, Elapsed Time: 172.38 seconds Step 7400/150000, Loss: 4.816269154548645, Test Loss: 4.809043526649475, LR: 0.0005, Elapsed Time: 174.77 seconds Step 7500/150000, Loss: 4.802456269264221, Test Loss: 4.805356919765472, LR: 0.0005, Elapsed Time: 177.14 seconds Step 7600/150000, Loss: 4.80144115447998, Test Loss: 4.794584631919861, LR: 0.0005, Elapsed Time: 179.51 seconds Step 7700/150000, Loss: 4.802568163871765, Test Loss: 4.788528084754944, LR: 0.0005, Elapsed Time: 181.89 seconds Step 7800/150000, Loss: 4.787138748168945, Test Loss: 4.786709249019623, LR: 0.0005, Elapsed Time: 184.27 seconds Step 7900/150000, Loss: 4.784691848754883, Test Loss: 4.775909423828125, LR: 0.0005, Elapsed Time: 186.64 seconds Step 8000/150000, Loss: 4.776335954666138, Test Loss: 4.773558974266052, LR: 0.0005, Elapsed Time: 189.00 seconds Step 8100/150000, Loss: 4.779001107215882, Test Loss: 4.7703112959861755, LR: 0.0005, Elapsed Time: 191.39 seconds Step 8200/150000, Loss: 4.767977137565612, Test Loss: 4.770996391773224, LR: 0.0005, Elapsed Time: 193.75 seconds Step 8300/150000, Loss: 4.779855132102966, Test Loss: 4.761614084243774, LR: 0.0005, Elapsed Time: 196.13 seconds Step 8400/150000, Loss: 4.761454000473022, Test Loss: 4.757183194160461, LR: 0.0005, Elapsed Time: 198.49 seconds Step 8500/150000, Loss: 4.767165241241455, Test Loss: 4.748279631137848, LR: 0.0005, Elapsed Time: 200.86 seconds Step 8600/150000, Loss: 4.755273714065551, Test Loss: 4.753763735294342, LR: 0.0005, Elapsed Time: 203.23 seconds Step 8700/150000, Loss: 4.746888866424561, Test Loss: 4.743083596229553, LR: 0.0005, Elapsed Time: 205.60 seconds Step 8800/150000, Loss: 4.7385933971405025, Test Loss: 4.739474534988403, LR: 0.0005, Elapsed Time: 207.97 seconds Step 8900/150000, Loss: 4.744684042930603, Test Loss: 4.730343699455261, LR: 0.0005, Elapsed Time: 210.34 seconds Step 9000/150000, Loss: 4.744727168083191, Test Loss: 4.725610077381134, LR: 0.0005, Elapsed Time: 212.73 seconds Step 9100/150000, Loss: 4.725877265930176, Test Loss: 4.724116683006287, LR: 0.0005, Elapsed Time: 215.11 seconds Step 9200/150000, Loss: 4.736619505882263, Test Loss: 4.724285244941711, LR: 0.0005, Elapsed Time: 217.48 seconds Step 9300/150000, Loss: 4.73272656917572, Test Loss: 4.719827175140381, LR: 0.0005, Elapsed Time: 219.86 seconds Step 9400/150000, Loss: 4.715517134666443, Test Loss: 4.717442572116852, LR: 0.0005, Elapsed Time: 222.24 seconds Step 9500/150000, Loss: 4.707191839218139, Test Loss: 4.7094573974609375, LR: 0.0005, Elapsed Time: 224.61 seconds Step 9600/150000, Loss: 4.722323498725891, Test Loss: 4.710386574268341, LR: 0.0005, Elapsed Time: 226.99 seconds Step 9700/150000, Loss: 4.714656558036804, Test Loss: 4.702428579330444, LR: 0.0005, Elapsed Time: 229.39 seconds Step 9800/150000, Loss: 4.699899215698242, Test Loss: 4.701009273529053, LR: 0.0005, Elapsed Time: 231.77 seconds Step 9900/150000, Loss: 4.704809041023254, Test Loss: 4.698681116104126, LR: 0.0005, Elapsed Time: 234.16 seconds Step 10000/150000, Loss: 4.6960454845428465, Test Loss: 4.695659637451172, LR: 0.0005, Elapsed Time: 236.54 seconds Step 10100/150000, Loss: 4.702823138236999, Test Loss: 4.688012540340424, LR: 0.0005, Elapsed Time: 238.91 seconds Step 10200/150000, Loss: 4.6997597217559814, Test Loss: 4.688770413398743, LR: 0.0005, Elapsed Time: 241.29 seconds Step 10300/150000, Loss: 4.690613722801208, Test Loss: 4.682490944862366, LR: 0.0005, Elapsed Time: 243.67 seconds Step 10400/150000, Loss: 4.7003649854660035, Test Loss: 4.68109256029129, LR: 0.0005, Elapsed Time: 246.05 seconds Step 10500/150000, Loss: 4.681055660247803, Test Loss: 4.677388548851013, LR: 0.0005, Elapsed Time: 248.43 seconds Step 10600/150000, Loss: 4.694364914894104, Test Loss: 4.672137558460236, LR: 0.0005, Elapsed Time: 250.81 seconds Step 10700/150000, Loss: 4.672147765159607, Test Loss: 4.6690815687179565, LR: 0.0005, Elapsed Time: 253.19 seconds Step 10800/150000, Loss: 4.690005230903625, Test Loss: 4.667634189128876, LR: 0.0005, Elapsed Time: 255.57 seconds Step 10900/150000, Loss: 4.688742294311523, Test Loss: 4.664335489273071, LR: 0.0005, Elapsed Time: 257.95 seconds Step 11000/150000, Loss: 4.676637477874756, Test Loss: 4.665681898593903, LR: 0.0005, Elapsed Time: 260.32 seconds Step 11100/150000, Loss: 4.655379319190979, Test Loss: 4.660738587379456, LR: 0.0005, Elapsed Time: 262.69 seconds Step 11200/150000, Loss: 4.6605712890625, Test Loss: 4.656876027584076, LR: 0.0005, Elapsed Time: 265.06 seconds Step 11300/150000, Loss: 4.660525140762329, Test Loss: 4.65769898891449, LR: 0.0005, Elapsed Time: 267.45 seconds Step 11400/150000, Loss: 4.657942762374878, Test Loss: 4.656493246555328, LR: 0.0005, Elapsed Time: 269.82 seconds Step 11500/150000, Loss: 4.662105889320373, Test Loss: 4.649299919605255, LR: 0.0005, Elapsed Time: 272.20 seconds Step 11600/150000, Loss: 4.657964215278626, Test Loss: 4.64749014377594, LR: 0.0005, Elapsed Time: 274.59 seconds Step 11700/150000, Loss: 4.648822097778321, Test Loss: 4.645352005958557, LR: 0.0005, Elapsed Time: 276.97 seconds Step 11800/150000, Loss: 4.652387123107911, Test Loss: 4.639504134654999, LR: 0.0005, Elapsed Time: 279.34 seconds Step 11900/150000, Loss: 4.6380134963989255, Test Loss: 4.6393778920173645, LR: 0.0005, Elapsed Time: 281.74 seconds Step 12000/150000, Loss: 4.651171927452087, Test Loss: 4.6370962262153625, LR: 0.0005, Elapsed Time: 284.13 seconds Step 12100/150000, Loss: 4.644910125732422, Test Loss: 4.63198983669281, LR: 0.0005, Elapsed Time: 286.52 seconds Step 12200/150000, Loss: 4.645656332969666, Test Loss: 4.628260791301727, LR: 0.0005, Elapsed Time: 288.89 seconds Step 12300/150000, Loss: 4.640872015953064, Test Loss: 4.628953993320465, LR: 0.0005, Elapsed Time: 291.28 seconds Step 12400/150000, Loss: 4.630461206436157, Test Loss: 4.629278719425201, LR: 0.0005, Elapsed Time: 293.67 seconds Step 12500/150000, Loss: 4.62983355998993, Test Loss: 4.623811304569244, LR: 0.0005, Elapsed Time: 296.04 seconds Step 12600/150000, Loss: 4.635344209671021, Test Loss: 4.620035171508789, LR: 0.0005, Elapsed Time: 298.42 seconds Step 12700/150000, Loss: 4.6240947675704955, Test Loss: 4.618345737457275, LR: 0.0005, Elapsed Time: 300.80 seconds Step 12800/150000, Loss: 4.628077640533447, Test Loss: 4.61910343170166, LR: 0.0005, Elapsed Time: 303.20 seconds Step 12900/150000, Loss: 4.617295336723328, Test Loss: 4.615666449069977, LR: 0.0005, Elapsed Time: 305.58 seconds Step 13000/150000, Loss: 4.618827686309815, Test Loss: 4.60864382982254, LR: 0.0005, Elapsed Time: 307.96 seconds Step 13100/150000, Loss: 4.607299704551696, Test Loss: 4.609173536300659, LR: 0.0005, Elapsed Time: 310.33 seconds Step 13200/150000, Loss: 4.60929470539093, Test Loss: 4.60765665769577, LR: 0.0005, Elapsed Time: 312.72 seconds Step 13300/150000, Loss: 4.6194158554077145, Test Loss: 4.606265366077423, LR: 0.0005, Elapsed Time: 315.11 seconds Step 13400/150000, Loss: 4.60871750831604, Test Loss: 4.604590952396393, LR: 0.0005, Elapsed Time: 317.51 seconds Step 13500/150000, Loss: 4.60270761013031, Test Loss: 4.601124465465546, LR: 0.0005, Elapsed Time: 319.90 seconds Step 13600/150000, Loss: 4.595364627838134, Test Loss: 4.595627248287201, LR: 0.0005, Elapsed Time: 322.30 seconds Step 13700/150000, Loss: 4.605753192901611, Test Loss: 4.5982765555381775, LR: 0.0005, Elapsed Time: 324.69 seconds Step 13800/150000, Loss: 4.591376285552979, Test Loss: 4.593955039978027, LR: 0.0005, Elapsed Time: 327.11 seconds Step 13900/150000, Loss: 4.597605652809143, Test Loss: 4.592446982860565, LR: 0.0005, Elapsed Time: 329.49 seconds Step 14000/150000, Loss: 4.601936831474304, Test Loss: 4.588154673576355, LR: 0.0005, Elapsed Time: 331.90 seconds Step 14100/150000, Loss: 4.590036926269531, Test Loss: 4.585490107536316, LR: 0.0005, Elapsed Time: 334.29 seconds Step 14200/150000, Loss: 4.589709963798523, Test Loss: 4.588991165161133, LR: 0.0005, Elapsed Time: 336.70 seconds Step 14300/150000, Loss: 4.5951982021331785, Test Loss: 4.587121069431305, LR: 0.0005, Elapsed Time: 339.09 seconds Step 14400/150000, Loss: 4.585908894538879, Test Loss: 4.579719245433807, LR: 0.0005, Elapsed Time: 341.49 seconds Step 14500/150000, Loss: 4.589031548500061, Test Loss: 4.57997453212738, LR: 0.0005, Elapsed Time: 343.89 seconds Step 14600/150000, Loss: 4.5824239587783815, Test Loss: 4.576378583908081, LR: 0.0005, Elapsed Time: 346.28 seconds Step 14700/150000, Loss: 4.56362606048584, Test Loss: 4.574739456176758, LR: 0.0005, Elapsed Time: 348.67 seconds Step 14800/150000, Loss: 4.576689658164978, Test Loss: 4.57117760181427, LR: 0.0005, Elapsed Time: 351.07 seconds Step 14900/150000, Loss: 4.57519483089447, Test Loss: 4.568003475666046, LR: 0.0005, Elapsed Time: 353.47 seconds Step 15000/150000, Loss: 4.571347131729126, Test Loss: 4.570954501628876, LR: 0.0005, Elapsed Time: 355.86 seconds Step 15100/150000, Loss: 4.577577257156372, Test Loss: 4.567755937576294, LR: 0.0005, Elapsed Time: 358.25 seconds Step 15200/150000, Loss: 4.562372999191284, Test Loss: 4.568525493144989, LR: 0.0005, Elapsed Time: 360.65 seconds Step 15300/150000, Loss: 4.559005184173584, Test Loss: 4.560401022434235, LR: 0.0005, Elapsed Time: 363.04 seconds Step 15400/150000, Loss: 4.5553831148147585, Test Loss: 4.561410963535309, LR: 0.0005, Elapsed Time: 365.45 seconds Step 15500/150000, Loss: 4.5776596355438235, Test Loss: 4.559787631034851, LR: 0.0005, Elapsed Time: 367.85 seconds Step 15600/150000, Loss: 4.548686604499817, Test Loss: 4.555811166763306, LR: 0.0005, Elapsed Time: 370.26 seconds Step 15700/150000, Loss: 4.576506309509277, Test Loss: 4.5573484897613525, LR: 0.0005, Elapsed Time: 372.66 seconds Step 15800/150000, Loss: 4.565245881080627, Test Loss: 4.554625511169434, LR: 0.0005, Elapsed Time: 375.06 seconds Step 15900/150000, Loss: 4.546299142837524, Test Loss: 4.552566707134247, LR: 0.0005, Elapsed Time: 377.46 seconds Step 16000/150000, Loss: 4.540788903236389, Test Loss: 4.557130753993988, LR: 0.0005, Elapsed Time: 379.87 seconds Step 16100/150000, Loss: 4.5580313730239865, Test Loss: 4.548960506916046, LR: 0.0005, Elapsed Time: 382.27 seconds Step 16200/150000, Loss: 4.553918199539185, Test Loss: 4.548383891582489, LR: 0.0005, Elapsed Time: 384.67 seconds Step 16300/150000, Loss: 4.553248920440674, Test Loss: 4.543776452541351, LR: 0.0005, Elapsed Time: 387.06 seconds Step 16400/150000, Loss: 4.550207643508911, Test Loss: 4.54317843914032, LR: 0.0005, Elapsed Time: 389.44 seconds Step 16500/150000, Loss: 4.546432008743286, Test Loss: 4.539352059364319, LR: 0.0005, Elapsed Time: 391.84 seconds Step 16600/150000, Loss: 4.546410517692566, Test Loss: 4.541164398193359, LR: 0.0005, Elapsed Time: 394.23 seconds Step 16700/150000, Loss: 4.54778567314148, Test Loss: 4.534104883670807, LR: 0.0005, Elapsed Time: 396.62 seconds Step 16800/150000, Loss: 4.544259872436523, Test Loss: 4.535196006298065, LR: 0.0005, Elapsed Time: 399.03 seconds Step 16900/150000, Loss: 4.540018525123596, Test Loss: 4.535028040409088, LR: 0.0005, Elapsed Time: 401.43 seconds Step 17000/150000, Loss: 4.53421151638031, Test Loss: 4.530624985694885, LR: 0.0005, Elapsed Time: 403.84 seconds Step 17100/150000, Loss: 4.528057346343994, Test Loss: 4.531696438789368, LR: 0.0005, Elapsed Time: 406.24 seconds Step 17200/150000, Loss: 4.545023851394653, Test Loss: 4.53470242023468, LR: 0.0005, Elapsed Time: 408.64 seconds Step 17300/150000, Loss: 4.53915090084076, Test Loss: 4.526281297206879, LR: 0.0005, Elapsed Time: 411.06 seconds Step 17400/150000, Loss: 4.538897013664245, Test Loss: 4.524180054664612, LR: 0.0005, Elapsed Time: 413.46 seconds Step 17500/150000, Loss: 4.50747362613678, Test Loss: 4.526680409908295, LR: 0.0005, Elapsed Time: 415.88 seconds Step 17600/150000, Loss: 4.5248877763748165, Test Loss: 4.519170045852661, LR: 0.0005, Elapsed Time: 418.28 seconds Step 17700/150000, Loss: 4.5213628768920895, Test Loss: 4.52188104391098, LR: 0.0005, Elapsed Time: 420.69 seconds Step 17800/150000, Loss: 4.534228496551513, Test Loss: 4.523009479045868, LR: 0.0005, Elapsed Time: 423.11 seconds Step 17900/150000, Loss: 4.525176420211792, Test Loss: 4.51629912853241, LR: 0.0005, Elapsed Time: 425.51 seconds Step 18000/150000, Loss: 4.5262261533737185, Test Loss: 4.518033802509308, LR: 0.0005, Elapsed Time: 427.92 seconds Step 18100/150000, Loss: 4.519547142982483, Test Loss: 4.521687209606171, LR: 0.0005, Elapsed Time: 430.32 seconds Step 18200/150000, Loss: 4.52007483959198, Test Loss: 4.516174495220184, LR: 0.0005, Elapsed Time: 432.74 seconds Step 18300/150000, Loss: 4.509054546356201, Test Loss: 4.51198947429657, LR: 0.0005, Elapsed Time: 435.14 seconds Step 18400/150000, Loss: 4.511096715927124, Test Loss: 4.510901987552643, LR: 0.0005, Elapsed Time: 437.55 seconds Step 18500/150000, Loss: 4.520346689224243, Test Loss: 4.510843813419342, LR: 0.0005, Elapsed Time: 439.97 seconds Step 18600/150000, Loss: 4.519790368080139, Test Loss: 4.507764577865601, LR: 0.0005, Elapsed Time: 442.37 seconds Step 18700/150000, Loss: 4.522249169349671, Test Loss: 4.503726065158844, LR: 0.0005, Elapsed Time: 444.77 seconds Step 18800/150000, Loss: 4.512528171539307, Test Loss: 4.505193412303925, LR: 0.0005, Elapsed Time: 447.18 seconds Step 18900/150000, Loss: 4.512978367805481, Test Loss: 4.502611815929413, LR: 0.0005, Elapsed Time: 449.58 seconds Step 19000/150000, Loss: 4.500606164932251, Test Loss: 4.504136919975281, LR: 0.0005, Elapsed Time: 451.99 seconds Step 19100/150000, Loss: 4.510384464263916, Test Loss: 4.502659559249878, LR: 0.0005, Elapsed Time: 454.39 seconds Step 19200/150000, Loss: 4.500471243858337, Test Loss: 4.497892618179321, LR: 0.0005, Elapsed Time: 456.78 seconds Step 19300/150000, Loss: 4.493755359649658, Test Loss: 4.497722685337067, LR: 0.0005, Elapsed Time: 459.18 seconds Step 19400/150000, Loss: 4.504688720703125, Test Loss: 4.4977006316185, LR: 0.0005, Elapsed Time: 461.60 seconds Step 19500/150000, Loss: 4.50128930568695, Test Loss: 4.498151659965515, LR: 0.0005, Elapsed Time: 464.00 seconds Step 19600/150000, Loss: 4.498463568687439, Test Loss: 4.493999004364014, LR: 0.0005, Elapsed Time: 466.42 seconds Step 19700/150000, Loss: 4.492149424552918, Test Loss: 4.492654085159302, LR: 0.0005, Elapsed Time: 468.83 seconds Step 19800/150000, Loss: 4.49811240196228, Test Loss: 4.490227580070496, LR: 0.0005, Elapsed Time: 471.26 seconds Step 19900/150000, Loss: 4.504788327217102, Test Loss: 4.491574704647064, LR: 0.0005, Elapsed Time: 473.67 seconds Step 20000/150000, Loss: 4.504937472343445, Test Loss: 4.487073063850403, LR: 0.0005, Elapsed Time: 476.09 seconds Step 20100/150000, Loss: 4.488566756248474, Test Loss: 4.4853785037994385, LR: 0.0005, Elapsed Time: 478.50 seconds Step 20200/150000, Loss: 4.486374454498291, Test Loss: 4.488079309463501, LR: 0.0005, Elapsed Time: 480.90 seconds Step 20300/150000, Loss: 4.48078682422638, Test Loss: 4.483373641967773, LR: 0.0005, Elapsed Time: 483.31 seconds Step 20400/150000, Loss: 4.49204749584198, Test Loss: 4.48321533203125, LR: 0.0005, Elapsed Time: 485.72 seconds Step 20500/150000, Loss: 4.4949183797836305, Test Loss: 4.481527030467987, LR: 0.0005, Elapsed Time: 488.14 seconds Step 20600/150000, Loss: 4.476307392120361, Test Loss: 4.481425166130066, LR: 0.0005, Elapsed Time: 490.54 seconds Step 20700/150000, Loss: 4.4942142772674565, Test Loss: 4.481167137622833, LR: 0.0005, Elapsed Time: 492.95 seconds Step 20800/150000, Loss: 4.475279421806335, Test Loss: 4.478660464286804, LR: 0.0005, Elapsed Time: 495.34 seconds Step 20900/150000, Loss: 4.473546757698059, Test Loss: 4.476366579532623, LR: 0.0005, Elapsed Time: 497.75 seconds Step 21000/150000, Loss: 4.4948132133483885, Test Loss: 4.473031401634216, LR: 0.0005, Elapsed Time: 500.16 seconds Step 21100/150000, Loss: 4.486972222328186, Test Loss: 4.479911386966705, LR: 0.0005, Elapsed Time: 502.57 seconds Step 21200/150000, Loss: 4.487997083663941, Test Loss: 4.471247494220734, LR: 0.0005, Elapsed Time: 504.98 seconds Step 21300/150000, Loss: 4.487250213623047, Test Loss: 4.472346365451813, LR: 0.0005, Elapsed Time: 507.37 seconds Step 21400/150000, Loss: 4.478231444358825, Test Loss: 4.469191074371338, LR: 0.0005, Elapsed Time: 509.79 seconds Step 21500/150000, Loss: 4.495006847381592, Test Loss: 4.466927647590637, LR: 0.0005, Elapsed Time: 512.21 seconds Step 21600/150000, Loss: 4.467617635726929, Test Loss: 4.470831453800201, LR: 0.0005, Elapsed Time: 514.61 seconds Step 21700/150000, Loss: 4.482581076622009, Test Loss: 4.468084454536438, LR: 0.0005, Elapsed Time: 517.02 seconds Step 21800/150000, Loss: 4.460470933914184, Test Loss: 4.464775919914246, LR: 0.0005, Elapsed Time: 519.42 seconds Step 21900/150000, Loss: 4.478560075759888, Test Loss: 4.467622339725494, LR: 0.0005, Elapsed Time: 521.81 seconds Step 22000/150000, Loss: 4.464054803848267, Test Loss: 4.463311851024628, LR: 0.0005, Elapsed Time: 524.07 seconds Step 22100/150000, Loss: 4.463960957527161, Test Loss: 4.460888147354126, LR: 0.0005, Elapsed Time: 526.32 seconds Step 22200/150000, Loss: 4.4710403203964235, Test Loss: 4.459406912326813, LR: 0.0005, Elapsed Time: 528.58 seconds Step 22300/150000, Loss: 4.467518124580383, Test Loss: 4.460662126541138, LR: 0.0005, Elapsed Time: 530.84 seconds Step 22400/150000, Loss: 4.471158032417297, Test Loss: 4.460574209690094, LR: 0.0005, Elapsed Time: 533.10 seconds Step 22500/150000, Loss: 4.4601910161972045, Test Loss: 4.459298193454742, LR: 0.0005, Elapsed Time: 535.35 seconds Step 22600/150000, Loss: 4.468060073852539, Test Loss: 4.459884881973267, LR: 0.0005, Elapsed Time: 537.60 seconds Step 22700/150000, Loss: 4.461251430511474, Test Loss: 4.45482325553894, LR: 0.0005, Elapsed Time: 539.87 seconds Step 22800/150000, Loss: 4.451180963516236, Test Loss: 4.457395493984222, LR: 0.0005, Elapsed Time: 542.13 seconds Step 22900/150000, Loss: 4.441122388839721, Test Loss: 4.4564467668533325, LR: 0.0005, Elapsed Time: 544.38 seconds Step 23000/150000, Loss: 4.458741893768311, Test Loss: 4.450775980949402, LR: 0.0005, Elapsed Time: 546.64 seconds Step 23100/150000, Loss: 4.446218361854553, Test Loss: 4.453252732753754, LR: 0.0005, Elapsed Time: 548.91 seconds Step 23200/150000, Loss: 4.446700835227967, Test Loss: 4.449577867984772, LR: 0.0005, Elapsed Time: 551.17 seconds Step 23300/150000, Loss: 4.459662127494812, Test Loss: 4.451347589492798, LR: 0.0005, Elapsed Time: 553.43 seconds Step 23400/150000, Loss: 4.456698393821716, Test Loss: 4.447441279888153, LR: 0.0005, Elapsed Time: 555.70 seconds Step 23500/150000, Loss: 4.446945352554321, Test Loss: 4.447851836681366, LR: 0.0005, Elapsed Time: 557.96 seconds Step 23600/150000, Loss: 4.456517810821533, Test Loss: 4.4493526220321655, LR: 0.0005, Elapsed Time: 560.22 seconds Step 23700/150000, Loss: 4.447642521858215, Test Loss: 4.448650002479553, LR: 0.0005, Elapsed Time: 562.48 seconds Step 23800/150000, Loss: 4.445041146278381, Test Loss: 4.450870215892792, LR: 0.0005, Elapsed Time: 564.77 seconds Step 23900/150000, Loss: 4.4457080364227295, Test Loss: 4.44720458984375, LR: 0.0005, Elapsed Time: 567.03 seconds Step 24000/150000, Loss: 4.450307750701905, Test Loss: 4.44503778219223, LR: 0.0005, Elapsed Time: 569.30 seconds Step 24100/150000, Loss: 4.4362435102462765, Test Loss: 4.441306471824646, LR: 0.0005, Elapsed Time: 571.57 seconds Step 24200/150000, Loss: 4.449194784164429, Test Loss: 4.442341685295105, LR: 0.0005, Elapsed Time: 573.83 seconds Step 24300/150000, Loss: 4.449483957290649, Test Loss: 4.446923017501831, LR: 0.0005, Elapsed Time: 576.09 seconds Step 24400/150000, Loss: 4.438649277687073, Test Loss: 4.442263424396515, LR: 0.0005, Elapsed Time: 578.34 seconds Step 24500/150000, Loss: 4.440170502662658, Test Loss: 4.439618647098541, LR: 0.0005, Elapsed Time: 580.59 seconds Step 24600/150000, Loss: 4.4329999780654905, Test Loss: 4.433749496936798, LR: 0.0005, Elapsed Time: 582.85 seconds Step 24700/150000, Loss: 4.430663704872131, Test Loss: 4.441073060035706, LR: 0.0005, Elapsed Time: 585.11 seconds Step 24800/150000, Loss: 4.438053326606751, Test Loss: 4.435450494289398, LR: 0.0005, Elapsed Time: 587.36 seconds Step 24900/150000, Loss: 4.423553228378296, Test Loss: 4.431729257106781, LR: 0.0005, Elapsed Time: 589.62 seconds Step 25000/150000, Loss: 4.435029044151306, Test Loss: 4.431743741035461, LR: 0.0005, Elapsed Time: 591.90 seconds Step 25100/150000, Loss: 4.4433708047866824, Test Loss: 4.435071527957916, LR: 0.0005, Elapsed Time: 594.16 seconds Step 25200/150000, Loss: 4.437222938537598, Test Loss: 4.433805704116821, LR: 0.0005, Elapsed Time: 596.43 seconds Step 25300/150000, Loss: 4.435000338554382, Test Loss: 4.431661307811737, LR: 0.0005, Elapsed Time: 598.70 seconds Step 25400/150000, Loss: 4.440752034187317, Test Loss: 4.430679559707642, LR: 0.0005, Elapsed Time: 600.96 seconds Step 25500/150000, Loss: 4.432871408462525, Test Loss: 4.427384614944458, LR: 0.0005, Elapsed Time: 603.23 seconds Step 25600/150000, Loss: 4.432652449607849, Test Loss: 4.427745819091797, LR: 0.0005, Elapsed Time: 605.50 seconds Step 25700/150000, Loss: 4.431785607337952, Test Loss: 4.427912890911102, LR: 0.0005, Elapsed Time: 607.76 seconds Step 25800/150000, Loss: 4.4236441469192505, Test Loss: 4.426806747913361, LR: 0.0005, Elapsed Time: 610.03 seconds Step 25900/150000, Loss: 4.4386088514328, Test Loss: 4.427948415279388, LR: 0.0005, Elapsed Time: 612.29 seconds Step 26000/150000, Loss: 4.424503326416016, Test Loss: 4.4251527190208435, LR: 0.0005, Elapsed Time: 614.56 seconds Step 26100/150000, Loss: 4.418567638397217, Test Loss: 4.424237251281738, LR: 0.0005, Elapsed Time: 616.84 seconds Step 26200/150000, Loss: 4.428577823638916, Test Loss: 4.423579573631287, LR: 0.0005, Elapsed Time: 619.10 seconds Step 26300/150000, Loss: 4.419262046813965, Test Loss: 4.422359645366669, LR: 0.0005, Elapsed Time: 621.37 seconds Step 26400/150000, Loss: 4.41893328666687, Test Loss: 4.424150764942169, LR: 0.0005, Elapsed Time: 623.63 seconds Step 26500/150000, Loss: 4.429289331436157, Test Loss: 4.420376837253571, LR: 0.0005, Elapsed Time: 625.90 seconds Step 26600/150000, Loss: 4.426093044281006, Test Loss: 4.426370143890381, LR: 0.0005, Elapsed Time: 628.16 seconds Step 26700/150000, Loss: 4.424168734550476, Test Loss: 4.425789177417755, LR: 0.0005, Elapsed Time: 630.42 seconds Step 26800/150000, Loss: 4.425082349777222, Test Loss: 4.422569155693054, LR: 0.0005, Elapsed Time: 632.69 seconds Step 26900/150000, Loss: 4.411873435974121, Test Loss: 4.423566162586212, LR: 0.0005, Elapsed Time: 634.96 seconds Step 27000/150000, Loss: 4.410008502006531, Test Loss: 4.420478522777557, LR: 0.0005, Elapsed Time: 637.21 seconds Step 27100/150000, Loss: 4.41948438167572, Test Loss: 4.421168088912964, LR: 0.0005, Elapsed Time: 639.48 seconds Step 27200/150000, Loss: 4.4120335149765015, Test Loss: 4.421755135059357, LR: 0.0005, Elapsed Time: 641.74 seconds Step 27300/150000, Loss: 4.416587877273559, Test Loss: 4.4179787039756775, LR: 0.0005, Elapsed Time: 644.00 seconds Step 27400/150000, Loss: 4.410479040145874, Test Loss: 4.419560968875885, LR: 0.0005, Elapsed Time: 646.26 seconds Step 27500/150000, Loss: 4.420058522224426, Test Loss: 4.41626113653183, LR: 0.0005, Elapsed Time: 648.51 seconds Step 27600/150000, Loss: 4.411013612747192, Test Loss: 4.416729092597961, LR: 0.0005, Elapsed Time: 650.76 seconds Step 27700/150000, Loss: 4.39310619354248, Test Loss: 4.417173445224762, LR: 0.0005, Elapsed Time: 653.03 seconds Step 27800/150000, Loss: 4.412046957015991, Test Loss: 4.41639631986618, LR: 0.0005, Elapsed Time: 655.29 seconds Step 27900/150000, Loss: 4.426310815811157, Test Loss: 4.4108535051345825, LR: 0.0005, Elapsed Time: 657.55 seconds Step 28000/150000, Loss: 4.402293486595154, Test Loss: 4.411234736442566, LR: 0.0005, Elapsed Time: 659.81 seconds Step 28100/150000, Loss: 4.411099290847778, Test Loss: 4.4097554087638855, LR: 0.0005, Elapsed Time: 662.08 seconds Step 28200/150000, Loss: 4.406350111961364, Test Loss: 4.412723124027252, LR: 0.0005, Elapsed Time: 664.34 seconds Step 28300/150000, Loss: 4.403218817710877, Test Loss: 4.414161562919617, LR: 0.0005, Elapsed Time: 666.62 seconds Step 28400/150000, Loss: 4.40811327457428, Test Loss: 4.408611536026001, LR: 0.0005, Elapsed Time: 668.88 seconds Step 28500/150000, Loss: 4.416187644004822, Test Loss: 4.409826397895813, LR: 0.0005, Elapsed Time: 671.13 seconds Step 28600/150000, Loss: 4.402096080780029, Test Loss: 4.408366680145264, LR: 0.0005, Elapsed Time: 673.39 seconds Step 28700/150000, Loss: 4.396033329963684, Test Loss: 4.409279406070709, LR: 0.0005, Elapsed Time: 675.65 seconds Step 28800/150000, Loss: 4.399459547996521, Test Loss: 4.40605354309082, LR: 0.0005, Elapsed Time: 677.91 seconds Step 28900/150000, Loss: 4.400590462684631, Test Loss: 4.407582402229309, LR: 0.0005, Elapsed Time: 680.17 seconds Step 29000/150000, Loss: 4.391732535362244, Test Loss: 4.407607555389404, LR: 0.0005, Elapsed Time: 682.42 seconds Step 29100/150000, Loss: 4.400592665672303, Test Loss: 4.405213356018066, LR: 0.0005, Elapsed Time: 684.68 seconds Step 29200/150000, Loss: 4.388068222999573, Test Loss: 4.404130399227142, LR: 0.0005, Elapsed Time: 686.94 seconds Step 29300/150000, Loss: 4.410907397270202, Test Loss: 4.408651649951935, LR: 0.0005, Elapsed Time: 689.19 seconds Step 29400/150000, Loss: 4.387275414466858, Test Loss: 4.404370546340942, LR: 0.0005, Elapsed Time: 691.45 seconds Step 29500/150000, Loss: 4.39313512802124, Test Loss: 4.4018683433532715, LR: 0.0005, Elapsed Time: 693.70 seconds Step 29600/150000, Loss: 4.402139592170715, Test Loss: 4.400742292404175, LR: 0.0005, Elapsed Time: 695.96 seconds Step 29700/150000, Loss: 4.381916408538818, Test Loss: 4.397399485111237, LR: 0.0005, Elapsed Time: 698.21 seconds Step 29800/150000, Loss: 4.399872727394104, Test Loss: 4.392035067081451, LR: 0.0005, Elapsed Time: 700.46 seconds Step 29900/150000, Loss: 4.382444906234741, Test Loss: 4.3972790241241455, LR: 0.0005, Elapsed Time: 702.74 seconds Step 30000/150000, Loss: 4.393869862556458, Test Loss: 4.395161330699921, LR: 0.0005, Elapsed Time: 705.00 seconds Step 30100/150000, Loss: 4.3928053665161135, Test Loss: 4.396532595157623, LR: 0.0005, Elapsed Time: 707.27 seconds Step 30200/150000, Loss: 4.396444973945617, Test Loss: 4.391043245792389, LR: 0.0005, Elapsed Time: 709.53 seconds Step 30300/150000, Loss: 4.3898271512985225, Test Loss: 4.395740985870361, LR: 0.0005, Elapsed Time: 711.79 seconds Step 30400/150000, Loss: 4.393435115814209, Test Loss: 4.392382740974426, LR: 0.0005, Elapsed Time: 714.04 seconds Step 30500/150000, Loss: 4.384532475471497, Test Loss: 4.392955183982849, LR: 0.0005, Elapsed Time: 716.30 seconds Step 30600/150000, Loss: 4.3827527904510495, Test Loss: 4.392284631729126, LR: 0.0005, Elapsed Time: 718.55 seconds Step 30700/150000, Loss: 4.384660849571228, Test Loss: 4.391201198101044, LR: 0.0005, Elapsed Time: 720.81 seconds Step 30800/150000, Loss: 4.38840916633606, Test Loss: 4.3901514410972595, LR: 0.0005, Elapsed Time: 723.06 seconds Step 30900/150000, Loss: 4.387351589202881, Test Loss: 4.38919723033905, LR: 0.0005, Elapsed Time: 725.31 seconds Step 31000/150000, Loss: 4.383817443847656, Test Loss: 4.392574727535248, LR: 0.0005, Elapsed Time: 727.58 seconds Step 31100/150000, Loss: 4.391881256103516, Test Loss: 4.389953255653381, LR: 0.0005, Elapsed Time: 729.86 seconds Step 31200/150000, Loss: 4.387584595680237, Test Loss: 4.389747798442841, LR: 0.0005, Elapsed Time: 732.13 seconds Step 31300/150000, Loss: 4.372762093544006, Test Loss: 4.388463377952576, LR: 0.0005, Elapsed Time: 734.40 seconds Step 31400/150000, Loss: 4.373998460769653, Test Loss: 4.383456707000732, LR: 0.0005, Elapsed Time: 736.67 seconds Step 31500/150000, Loss: 4.383644113540649, Test Loss: 4.387653470039368, LR: 0.0005, Elapsed Time: 738.93 seconds Step 31600/150000, Loss: 4.382969341278076, Test Loss: 4.384958326816559, LR: 0.0005, Elapsed Time: 741.18 seconds Step 31700/150000, Loss: 4.375899558067322, Test Loss: 4.385817348957062, LR: 0.0005, Elapsed Time: 743.43 seconds Step 31800/150000, Loss: 4.374554219245911, Test Loss: 4.3795347809791565, LR: 0.0005, Elapsed Time: 745.69 seconds Step 31900/150000, Loss: 4.37382342338562, Test Loss: 4.383820116519928, LR: 0.0005, Elapsed Time: 747.95 seconds Step 32000/150000, Loss: 4.387607488632202, Test Loss: 4.379450142383575, LR: 0.0005, Elapsed Time: 750.22 seconds Step 32100/150000, Loss: 4.380309562683106, Test Loss: 4.388272702693939, LR: 0.0005, Elapsed Time: 752.48 seconds Step 32200/150000, Loss: 4.377749671936035, Test Loss: 4.379946351051331, LR: 0.0005, Elapsed Time: 754.73 seconds Step 32300/150000, Loss: 4.387304334640503, Test Loss: 4.380756139755249, LR: 0.0005, Elapsed Time: 756.99 seconds Step 32400/150000, Loss: 4.377603487968445, Test Loss: 4.377735614776611, LR: 0.0005, Elapsed Time: 759.25 seconds Step 32500/150000, Loss: 4.380426626205445, Test Loss: 4.374357581138611, LR: 0.0005, Elapsed Time: 761.51 seconds Step 32600/150000, Loss: 4.375449819564819, Test Loss: 4.377858996391296, LR: 0.0005, Elapsed Time: 763.76 seconds Step 32700/150000, Loss: 4.380830783843994, Test Loss: 4.378900170326233, LR: 0.0005, Elapsed Time: 766.02 seconds Step 32800/150000, Loss: 4.3901441860198975, Test Loss: 4.375666558742523, LR: 0.0005, Elapsed Time: 768.27 seconds Step 32900/150000, Loss: 4.372995648384094, Test Loss: 4.3753708600997925, LR: 0.0005, Elapsed Time: 770.53 seconds Step 33000/150000, Loss: 4.373090491294861, Test Loss: 4.377339065074921, LR: 0.0005, Elapsed Time: 772.79 seconds Step 33100/150000, Loss: 4.361830382347107, Test Loss: 4.376699388027191, LR: 0.0005, Elapsed Time: 775.05 seconds Step 33200/150000, Loss: 4.376027178764343, Test Loss: 4.376384794712067, LR: 0.0005, Elapsed Time: 777.30 seconds Step 33300/150000, Loss: 4.371391725540161, Test Loss: 4.376210808753967, LR: 0.0005, Elapsed Time: 779.56 seconds Step 33400/150000, Loss: 4.375049495697022, Test Loss: 4.374938786029816, LR: 0.0005, Elapsed Time: 781.82 seconds Step 33500/150000, Loss: 4.3732981824874875, Test Loss: 4.3742517828941345, LR: 0.0005, Elapsed Time: 784.08 seconds Step 33600/150000, Loss: 4.374722266197205, Test Loss: 4.373788356781006, LR: 0.0005, Elapsed Time: 786.34 seconds Step 33700/150000, Loss: 4.365082569122315, Test Loss: 4.374330461025238, LR: 0.0005, Elapsed Time: 788.60 seconds Step 33800/150000, Loss: 4.3672421169281, Test Loss: 4.377809166908264, LR: 0.0005, Elapsed Time: 790.86 seconds Step 33900/150000, Loss: 4.374629430770874, Test Loss: 4.371380150318146, LR: 0.0005, Elapsed Time: 793.13 seconds Step 34000/150000, Loss: 4.374588966369629, Test Loss: 4.370753049850464, LR: 0.0005, Elapsed Time: 795.38 seconds Step 34100/150000, Loss: 4.366230864524841, Test Loss: 4.371456027030945, LR: 0.0005, Elapsed Time: 797.64 seconds Step 34200/150000, Loss: 4.379716572761535, Test Loss: 4.369409501552582, LR: 0.0005, Elapsed Time: 799.90 seconds Step 34300/150000, Loss: 4.361826810836792, Test Loss: 4.367761611938477, LR: 0.0005, Elapsed Time: 802.16 seconds Step 34400/150000, Loss: 4.368473272323609, Test Loss: 4.367554426193237, LR: 0.0005, Elapsed Time: 804.42 seconds Step 34500/150000, Loss: 4.370284457206726, Test Loss: 4.3722087144851685, LR: 0.0005, Elapsed Time: 806.68 seconds Step 34600/150000, Loss: 4.359614334106445, Test Loss: 4.367729961872101, LR: 0.0005, Elapsed Time: 808.94 seconds Step 34700/150000, Loss: 4.363624267578125, Test Loss: 4.366275787353516, LR: 0.0005, Elapsed Time: 811.20 seconds Step 34800/150000, Loss: 4.365877966880799, Test Loss: 4.364705145359039, LR: 0.0005, Elapsed Time: 813.46 seconds Step 34900/150000, Loss: 4.361302971839905, Test Loss: 4.3635223507881165, LR: 0.0005, Elapsed Time: 815.72 seconds Step 35000/150000, Loss: 4.352422504425049, Test Loss: 4.364431381225586, LR: 0.0005, Elapsed Time: 817.98 seconds Step 35100/150000, Loss: 4.36617959022522, Test Loss: 4.36428302526474, LR: 0.0005, Elapsed Time: 820.23 seconds Step 35200/150000, Loss: 4.370190935134888, Test Loss: 4.361592948436737, LR: 0.0005, Elapsed Time: 822.48 seconds Step 35300/150000, Loss: 4.3569950675964355, Test Loss: 4.3648770451545715, LR: 0.0005, Elapsed Time: 824.74 seconds Step 35400/150000, Loss: 4.353726668357849, Test Loss: 4.362483501434326, LR: 0.0005, Elapsed Time: 826.99 seconds Step 35500/150000, Loss: 4.351029553413391, Test Loss: 4.358498394489288, LR: 0.0005, Elapsed Time: 829.25 seconds Step 35600/150000, Loss: 4.360661001205444, Test Loss: 4.362089097499847, LR: 0.0005, Elapsed Time: 831.51 seconds Step 35700/150000, Loss: 4.3541097784042355, Test Loss: 4.3572511076927185, LR: 0.0005, Elapsed Time: 833.77 seconds Step 35800/150000, Loss: 4.3574042224884035, Test Loss: 4.3599361181259155, LR: 0.0005, Elapsed Time: 836.02 seconds Step 35900/150000, Loss: 4.364496903419495, Test Loss: 4.357215940952301, LR: 0.0005, Elapsed Time: 838.27 seconds Step 36000/150000, Loss: 4.352040510177613, Test Loss: 4.359864771366119, LR: 0.0005, Elapsed Time: 840.53 seconds Step 36100/150000, Loss: 4.35408579826355, Test Loss: 4.359675645828247, LR: 0.0005, Elapsed Time: 842.78 seconds Step 36200/150000, Loss: 4.358658604621887, Test Loss: 4.360761523246765, LR: 0.0005, Elapsed Time: 845.04 seconds Step 36300/150000, Loss: 4.360071496963501, Test Loss: 4.358421444892883, LR: 0.0005, Elapsed Time: 847.29 seconds Step 36400/150000, Loss: 4.347355356216431, Test Loss: 4.358124673366547, LR: 0.0005, Elapsed Time: 849.55 seconds Step 36500/150000, Loss: 4.353447651863098, Test Loss: 4.35642808675766, LR: 0.0005, Elapsed Time: 851.81 seconds Step 36600/150000, Loss: 4.341551904678345, Test Loss: 4.35664302110672, LR: 0.0005, Elapsed Time: 854.07 seconds Step 36700/150000, Loss: 4.34647424697876, Test Loss: 4.355391681194305, LR: 0.0005, Elapsed Time: 856.33 seconds Step 36800/150000, Loss: 4.355445990562439, Test Loss: 4.354495346546173, LR: 0.0005, Elapsed Time: 858.58 seconds Step 36900/150000, Loss: 4.348022379875183, Test Loss: 4.356687784194946, LR: 0.0005, Elapsed Time: 860.85 seconds Step 37000/150000, Loss: 4.348324241638184, Test Loss: 4.355081081390381, LR: 0.0005, Elapsed Time: 863.10 seconds Step 37100/150000, Loss: 4.350793232917786, Test Loss: 4.354921996593475, LR: 0.0005, Elapsed Time: 865.36 seconds Step 37200/150000, Loss: 4.33569522857666, Test Loss: 4.354163229465485, LR: 0.0005, Elapsed Time: 867.61 seconds Step 37300/150000, Loss: 4.350907073020935, Test Loss: 4.351807236671448, LR: 0.0005, Elapsed Time: 869.87 seconds Step 37400/150000, Loss: 4.354293003082275, Test Loss: 4.348916828632355, LR: 0.0005, Elapsed Time: 872.13 seconds Step 37500/150000, Loss: 4.337583327293396, Test Loss: 4.351940929889679, LR: 0.0005, Elapsed Time: 874.38 seconds Step 37600/150000, Loss: 4.357893643379211, Test Loss: 4.353776633739471, LR: 0.0005, Elapsed Time: 876.63 seconds Step 37700/150000, Loss: 4.346215391159058, Test Loss: 4.350262761116028, LR: 0.0005, Elapsed Time: 878.90 seconds Step 37800/150000, Loss: 4.33481520652771, Test Loss: 4.352648735046387, LR: 0.0005, Elapsed Time: 881.16 seconds Step 37900/150000, Loss: 4.333804082870484, Test Loss: 4.3532180190086365, LR: 0.0005, Elapsed Time: 883.41 seconds Step 38000/150000, Loss: 4.351818733215332, Test Loss: 4.349226415157318, LR: 0.0005, Elapsed Time: 885.67 seconds Step 38100/150000, Loss: 4.339835095405578, Test Loss: 4.348971486091614, LR: 0.0005, Elapsed Time: 887.93 seconds Step 38200/150000, Loss: 4.347683038711548, Test Loss: 4.347927451133728, LR: 0.0005, Elapsed Time: 890.20 seconds Step 38300/150000, Loss: 4.3447941875457765, Test Loss: 4.343221724033356, LR: 0.0005, Elapsed Time: 892.45 seconds Step 38400/150000, Loss: 4.340312194824219, Test Loss: 4.340498566627502, LR: 0.0005, Elapsed Time: 894.72 seconds Step 38500/150000, Loss: 4.349240369796753, Test Loss: 4.340567767620087, LR: 0.0005, Elapsed Time: 896.97 seconds Step 38600/150000, Loss: 4.34198830127716, Test Loss: 4.342031240463257, LR: 0.0005, Elapsed Time: 899.22 seconds Step 38700/150000, Loss: 4.3499065160751345, Test Loss: 4.343287229537964, LR: 0.0005, Elapsed Time: 901.48 seconds Step 38800/150000, Loss: 4.334733047485352, Test Loss: 4.342368304729462, LR: 0.0005, Elapsed Time: 903.73 seconds Step 38900/150000, Loss: 4.331156539916992, Test Loss: 4.342459321022034, LR: 0.0005, Elapsed Time: 905.99 seconds Step 39000/150000, Loss: 4.345631823539734, Test Loss: 4.344482004642487, LR: 0.0005, Elapsed Time: 908.24 seconds Step 39100/150000, Loss: 4.336893978118897, Test Loss: 4.343028366565704, LR: 0.0005, Elapsed Time: 910.51 seconds Step 39200/150000, Loss: 4.3492778778076175, Test Loss: 4.339098334312439, LR: 0.0005, Elapsed Time: 912.77 seconds Step 39300/150000, Loss: 4.3380689573287965, Test Loss: 4.344177007675171, LR: 0.0005, Elapsed Time: 915.03 seconds Step 39400/150000, Loss: 4.320622668266297, Test Loss: 4.339345216751099, LR: 0.0005, Elapsed Time: 917.29 seconds Step 39500/150000, Loss: 4.32775728225708, Test Loss: 4.341570615768433, LR: 0.0005, Elapsed Time: 919.56 seconds Step 39600/150000, Loss: 4.338237323760986, Test Loss: 4.339925050735474, LR: 0.0005, Elapsed Time: 921.82 seconds Step 39700/150000, Loss: 4.344900646209717, Test Loss: 4.340912103652954, LR: 0.0005, Elapsed Time: 924.08 seconds Step 39800/150000, Loss: 4.33304072856903, Test Loss: 4.3389071226119995, LR: 0.0005, Elapsed Time: 926.34 seconds Step 39900/150000, Loss: 4.339319353103638, Test Loss: 4.339995861053467, LR: 0.0005, Elapsed Time: 928.59 seconds Step 40000/150000, Loss: 4.331869478225708, Test Loss: 4.339958131313324, LR: 0.0005, Elapsed Time: 930.85 seconds Step 40100/150000, Loss: 4.331388268470764, Test Loss: 4.3388766050338745, LR: 0.0005, Elapsed Time: 933.11 seconds Step 40200/150000, Loss: 4.324614334106445, Test Loss: 4.332013368606567, LR: 0.0005, Elapsed Time: 935.37 seconds Step 40300/150000, Loss: 4.331877217292786, Test Loss: 4.335694432258606, LR: 0.0005, Elapsed Time: 937.63 seconds Step 40400/150000, Loss: 4.341790790557861, Test Loss: 4.332803547382355, LR: 0.0005, Elapsed Time: 939.90 seconds Step 40500/150000, Loss: 4.340521183013916, Test Loss: 4.333991527557373, LR: 0.0005, Elapsed Time: 942.17 seconds Step 40600/150000, Loss: 4.332984075546265, Test Loss: 4.330583870410919, LR: 0.0005, Elapsed Time: 944.43 seconds Step 40700/150000, Loss: 4.339196267127991, Test Loss: 4.334278106689453, LR: 0.0005, Elapsed Time: 946.69 seconds Step 40800/150000, Loss: 4.3330592441558835, Test Loss: 4.331490576267242, LR: 0.0005, Elapsed Time: 948.94 seconds Step 40900/150000, Loss: 4.3247833251953125, Test Loss: 4.336824655532837, LR: 0.0005, Elapsed Time: 951.20 seconds Step 41000/150000, Loss: 4.3283611059188845, Test Loss: 4.333289921283722, LR: 0.0005, Elapsed Time: 953.46 seconds Step 41100/150000, Loss: 4.325459513664246, Test Loss: 4.331052601337433, LR: 0.0005, Elapsed Time: 955.72 seconds Step 41200/150000, Loss: 4.32569839477539, Test Loss: 4.333673655986786, LR: 0.0005, Elapsed Time: 957.98 seconds Step 41300/150000, Loss: 4.328653841018677, Test Loss: 4.333842694759369, LR: 0.0005, Elapsed Time: 960.24 seconds Step 41400/150000, Loss: 4.327797307968139, Test Loss: 4.33536958694458, LR: 0.0005, Elapsed Time: 962.50 seconds Step 41500/150000, Loss: 4.322942748069763, Test Loss: 4.333994209766388, LR: 0.0005, Elapsed Time: 964.76 seconds Step 41600/150000, Loss: 4.326237897872925, Test Loss: 4.331892788410187, LR: 0.0005, Elapsed Time: 967.02 seconds Step 41700/150000, Loss: 4.327303881645203, Test Loss: 4.329919993877411, LR: 0.0005, Elapsed Time: 969.28 seconds Step 41800/150000, Loss: 4.329608259201049, Test Loss: 4.328213632106781, LR: 0.0005, Elapsed Time: 971.55 seconds Step 41900/150000, Loss: 4.334939794540405, Test Loss: 4.330727398395538, LR: 0.0005, Elapsed Time: 973.81 seconds Step 42000/150000, Loss: 4.318497018814087, Test Loss: 4.32933121919632, LR: 0.0005, Elapsed Time: 976.07 seconds Step 42100/150000, Loss: 4.317577996253967, Test Loss: 4.329457461833954, LR: 0.0005, Elapsed Time: 978.33 seconds Step 42200/150000, Loss: 4.325277667045594, Test Loss: 4.330113470554352, LR: 0.0005, Elapsed Time: 980.60 seconds Step 42300/150000, Loss: 4.321885957717895, Test Loss: 4.328914165496826, LR: 0.0005, Elapsed Time: 982.86 seconds Step 42400/150000, Loss: 4.32630690574646, Test Loss: 4.330440580844879, LR: 0.0005, Elapsed Time: 985.12 seconds Step 42500/150000, Loss: 4.316565232276917, Test Loss: 4.326734244823456, LR: 0.0005, Elapsed Time: 987.38 seconds Step 42600/150000, Loss: 4.32872049331665, Test Loss: 4.326007187366486, LR: 0.0005, Elapsed Time: 989.64 seconds Step 42700/150000, Loss: 4.315379729270935, Test Loss: 4.331083834171295, LR: 0.0005, Elapsed Time: 991.91 seconds Step 42800/150000, Loss: 4.316136598587036, Test Loss: 4.32831746339798, LR: 0.0005, Elapsed Time: 994.16 seconds Step 42900/150000, Loss: 4.3366562938690185, Test Loss: 4.327862918376923, LR: 0.0005, Elapsed Time: 996.41 seconds Step 43000/150000, Loss: 4.321315011978149, Test Loss: 4.324210584163666, LR: 0.0005, Elapsed Time: 998.67 seconds Step 43100/150000, Loss: 4.331274108886719, Test Loss: 4.323949575424194, LR: 0.0005, Elapsed Time: 1000.94 seconds Step 43200/150000, Loss: 4.325090909004212, Test Loss: 4.323664605617523, LR: 0.0005, Elapsed Time: 1003.20 seconds Step 43300/150000, Loss: 4.3239258813858035, Test Loss: 4.326116740703583, LR: 0.0005, Elapsed Time: 1005.46 seconds Step 43400/150000, Loss: 4.326781797409057, Test Loss: 4.32248067855835, LR: 0.0005, Elapsed Time: 1007.72 seconds Step 43500/150000, Loss: 4.315217170715332, Test Loss: 4.3232086300849915, LR: 0.0005, Elapsed Time: 1009.99 seconds Step 43600/150000, Loss: 4.322398104667664, Test Loss: 4.3230814933776855, LR: 0.0005, Elapsed Time: 1012.25 seconds Step 43700/150000, Loss: 4.317909164428711, Test Loss: 4.323676407337189, LR: 0.0005, Elapsed Time: 1014.52 seconds Step 43800/150000, Loss: 4.318664588928223, Test Loss: 4.319726526737213, LR: 0.0005, Elapsed Time: 1016.79 seconds Step 43900/150000, Loss: 4.31348117351532, Test Loss: 4.319318115711212, LR: 0.0005, Elapsed Time: 1019.05 seconds Step 44000/150000, Loss: 4.315704684257508, Test Loss: 4.320447027683258, LR: 0.0005, Elapsed Time: 1021.31 seconds Step 44100/150000, Loss: 4.311765160560608, Test Loss: 4.319638609886169, LR: 0.0005, Elapsed Time: 1023.57 seconds Step 44200/150000, Loss: 4.3189832305908205, Test Loss: 4.317598223686218, LR: 0.0005, Elapsed Time: 1025.84 seconds Step 44300/150000, Loss: 4.313728837966919, Test Loss: 4.317281186580658, LR: 0.0005, Elapsed Time: 1028.10 seconds Step 44400/150000, Loss: 4.316393675804139, Test Loss: 4.318271517753601, LR: 0.0005, Elapsed Time: 1030.36 seconds Step 44500/150000, Loss: 4.317489829063415, Test Loss: 4.318049609661102, LR: 0.0005, Elapsed Time: 1032.62 seconds Step 44600/150000, Loss: 4.312700819969177, Test Loss: 4.3174044489860535, LR: 0.0005, Elapsed Time: 1034.88 seconds Step 44700/150000, Loss: 4.307196373939514, Test Loss: 4.318657040596008, LR: 0.0005, Elapsed Time: 1037.14 seconds Step 44800/150000, Loss: 4.292447304725647, Test Loss: 4.3172966837883, LR: 0.0005, Elapsed Time: 1039.40 seconds Step 44900/150000, Loss: 4.312977690696716, Test Loss: 4.316750347614288, LR: 0.0005, Elapsed Time: 1041.66 seconds Step 45000/150000, Loss: 4.307983584403992, Test Loss: 4.318207144737244, LR: 0.0005, Elapsed Time: 1043.93 seconds Step 45100/150000, Loss: 4.300743479728698, Test Loss: 4.316190123558044, LR: 0.0005, Elapsed Time: 1046.19 seconds Step 45200/150000, Loss: 4.318858017921448, Test Loss: 4.316973865032196, LR: 0.0005, Elapsed Time: 1048.45 seconds Step 45300/150000, Loss: 4.307809929847718, Test Loss: 4.313174486160278, LR: 0.0005, Elapsed Time: 1050.71 seconds Step 45400/150000, Loss: 4.313835868835449, Test Loss: 4.314944744110107, LR: 0.0005, Elapsed Time: 1052.97 seconds Step 45500/150000, Loss: 4.315958552360534, Test Loss: 4.315226852893829, LR: 0.0005, Elapsed Time: 1055.23 seconds Step 45600/150000, Loss: 4.303682713508606, Test Loss: 4.315033733844757, LR: 0.0005, Elapsed Time: 1057.48 seconds Step 45700/150000, Loss: 4.298544964790344, Test Loss: 4.316842675209045, LR: 0.0005, Elapsed Time: 1059.73 seconds Step 45800/150000, Loss: 4.317230377197266, Test Loss: 4.314151167869568, LR: 0.0005, Elapsed Time: 1061.99 seconds Step 45900/150000, Loss: 4.305426940917969, Test Loss: 4.312131762504578, LR: 0.0005, Elapsed Time: 1064.25 seconds Step 46000/150000, Loss: 4.303668761253357, Test Loss: 4.31278783082962, LR: 0.0005, Elapsed Time: 1066.50 seconds Step 46100/150000, Loss: 4.3128831624984745, Test Loss: 4.314027309417725, LR: 0.0005, Elapsed Time: 1068.76 seconds Step 46200/150000, Loss: 4.307659468650818, Test Loss: 4.31465870141983, LR: 0.0005, Elapsed Time: 1071.04 seconds Step 46300/150000, Loss: 4.302728533744812, Test Loss: 4.315938174724579, LR: 0.0005, Elapsed Time: 1073.29 seconds Step 46400/150000, Loss: 4.302399311065674, Test Loss: 4.311546146869659, LR: 0.0005, Elapsed Time: 1075.56 seconds Step 46500/150000, Loss: 4.299297099113464, Test Loss: 4.312092185020447, LR: 0.0005, Elapsed Time: 1077.82 seconds Step 46600/150000, Loss: 4.298504881858825, Test Loss: 4.312967240810394, LR: 0.0005, Elapsed Time: 1080.08 seconds Step 46700/150000, Loss: 4.2970996499061584, Test Loss: 4.3109259605407715, LR: 0.0005, Elapsed Time: 1082.34 seconds Step 46800/150000, Loss: 4.295693879127502, Test Loss: 4.308514058589935, LR: 0.0005, Elapsed Time: 1084.60 seconds Step 46900/150000, Loss: 4.30417311668396, Test Loss: 4.307890772819519, LR: 0.0005, Elapsed Time: 1086.86 seconds Step 47000/150000, Loss: 4.3118449354171755, Test Loss: 4.3115445375442505, LR: 0.0005, Elapsed Time: 1089.12 seconds Step 47100/150000, Loss: 4.304867763519287, Test Loss: 4.311034619808197, LR: 0.0005, Elapsed Time: 1091.38 seconds Step 47200/150000, Loss: 4.311420335769653, Test Loss: 4.309537410736084, LR: 0.0005, Elapsed Time: 1093.63 seconds Step 47300/150000, Loss: 4.303043909072876, Test Loss: 4.309673309326172, LR: 0.0005, Elapsed Time: 1095.89 seconds Step 47400/150000, Loss: 4.300684313774109, Test Loss: 4.30802047252655, LR: 0.0005, Elapsed Time: 1098.15 seconds Step 47500/150000, Loss: 4.303783621788025, Test Loss: 4.307301819324493, LR: 0.0005, Elapsed Time: 1100.40 seconds Step 47600/150000, Loss: 4.309947166442871, Test Loss: 4.309576570987701, LR: 0.0005, Elapsed Time: 1102.67 seconds Step 47700/150000, Loss: 4.297784986495972, Test Loss: 4.305806517601013, LR: 0.0005, Elapsed Time: 1104.93 seconds Step 47800/150000, Loss: 4.304680414199829, Test Loss: 4.306246221065521, LR: 0.0005, Elapsed Time: 1107.18 seconds Step 47900/150000, Loss: 4.295728740692138, Test Loss: 4.304622054100037, LR: 0.0005, Elapsed Time: 1109.45 seconds Step 48000/150000, Loss: 4.293417520523072, Test Loss: 4.306078851222992, LR: 0.0005, Elapsed Time: 1111.71 seconds Step 48100/150000, Loss: 4.306869530677796, Test Loss: 4.305638015270233, LR: 0.0005, Elapsed Time: 1113.97 seconds Step 48200/150000, Loss: 4.288901648521423, Test Loss: 4.308156251907349, LR: 0.0005, Elapsed Time: 1116.23 seconds Step 48300/150000, Loss: 4.303158226013184, Test Loss: 4.30575168132782, LR: 0.0005, Elapsed Time: 1118.49 seconds Step 48400/150000, Loss: 4.296811404228211, Test Loss: 4.3050936460494995, LR: 0.0005, Elapsed Time: 1120.76 seconds Step 48500/150000, Loss: 4.300083947181702, Test Loss: 4.306912302970886, LR: 0.0005, Elapsed Time: 1123.01 seconds Step 48600/150000, Loss: 4.300790305137634, Test Loss: 4.307137668132782, LR: 0.0005, Elapsed Time: 1125.27 seconds Step 48700/150000, Loss: 4.3039186000823975, Test Loss: 4.30562287569046, LR: 0.0005, Elapsed Time: 1127.53 seconds Step 48800/150000, Loss: 4.2847586297988896, Test Loss: 4.302871465682983, LR: 0.0005, Elapsed Time: 1129.79 seconds Step 48900/150000, Loss: 4.29343433380127, Test Loss: 4.304650902748108, LR: 0.0005, Elapsed Time: 1132.06 seconds Step 49000/150000, Loss: 4.294832220077515, Test Loss: 4.304498374462128, LR: 0.0005, Elapsed Time: 1134.31 seconds Step 49100/150000, Loss: 4.292713112831116, Test Loss: 4.305823385715485, LR: 0.0005, Elapsed Time: 1136.58 seconds Step 49200/150000, Loss: 4.299356880187989, Test Loss: 4.305614709854126, LR: 0.0005, Elapsed Time: 1138.84 seconds Step 49300/150000, Loss: 4.293326869010925, Test Loss: 4.304232716560364, LR: 0.0005, Elapsed Time: 1141.10 seconds Step 49400/150000, Loss: 4.289664545059204, Test Loss: 4.304398357868195, LR: 0.0005, Elapsed Time: 1143.36 seconds Step 49500/150000, Loss: 4.296453142166138, Test Loss: 4.306224703788757, LR: 0.0005, Elapsed Time: 1145.63 seconds Step 49600/150000, Loss: 4.283649091720581, Test Loss: 4.30394184589386, LR: 0.0005, Elapsed Time: 1147.90 seconds Step 49700/150000, Loss: 4.297687578201294, Test Loss: 4.305764853954315, LR: 0.0005, Elapsed Time: 1150.16 seconds Step 49800/150000, Loss: 4.302246584892273, Test Loss: 4.303353011608124, LR: 0.0005, Elapsed Time: 1152.42 seconds Step 49900/150000, Loss: 4.287914881706238, Test Loss: 4.300215542316437, LR: 0.0005, Elapsed Time: 1154.68 seconds Step 50000/150000, Loss: 4.286743950843811, Test Loss: 4.300099670886993, LR: 0.0005, Elapsed Time: 1156.95 seconds Saving model checkpoint at step 50000 Step 50100/150000, Loss: 4.294740209579468, Test Loss: 4.300348341464996, LR: 0.0005, Elapsed Time: 1159.29 seconds Step 50200/150000, Loss: 4.28968912601471, Test Loss: 4.302231550216675, LR: 0.0005, Elapsed Time: 1161.56 seconds Step 50300/150000, Loss: 4.2990798330307, Test Loss: 4.300710141658783, LR: 0.0005, Elapsed Time: 1163.81 seconds Step 50400/150000, Loss: 4.2895826244354245, Test Loss: 4.299021005630493, LR: 0.0005, Elapsed Time: 1166.07 seconds Step 50500/150000, Loss: 4.285569505691528, Test Loss: 4.29866749048233, LR: 0.0005, Elapsed Time: 1168.32 seconds Step 50600/150000, Loss: 4.2847246742248535, Test Loss: 4.299402058124542, LR: 0.0005, Elapsed Time: 1170.58 seconds Step 50700/150000, Loss: 4.286179513931274, Test Loss: 4.295972943305969, LR: 0.0005, Elapsed Time: 1172.84 seconds Step 50800/150000, Loss: 4.285420761108399, Test Loss: 4.297976493835449, LR: 0.0005, Elapsed Time: 1175.10 seconds Step 50900/150000, Loss: 4.284252314567566, Test Loss: 4.3009761571884155, LR: 0.0005, Elapsed Time: 1177.36 seconds Step 51000/150000, Loss: 4.284822101593018, Test Loss: 4.29770565032959, LR: 0.0005, Elapsed Time: 1179.60 seconds Step 51100/150000, Loss: 4.274978652000427, Test Loss: 4.29769504070282, LR: 0.0005, Elapsed Time: 1181.86 seconds Step 51200/150000, Loss: 4.298515391349793, Test Loss: 4.299999713897705, LR: 0.0005, Elapsed Time: 1184.11 seconds Step 51300/150000, Loss: 4.276268835067749, Test Loss: 4.298844516277313, LR: 0.0005, Elapsed Time: 1186.36 seconds Step 51400/150000, Loss: 4.28612368106842, Test Loss: 4.2958661913871765, LR: 0.0005, Elapsed Time: 1188.61 seconds Step 51500/150000, Loss: 4.289710254669189, Test Loss: 4.29628449678421, LR: 0.0005, Elapsed Time: 1190.86 seconds Step 51600/150000, Loss: 4.276264214515686, Test Loss: 4.293637216091156, LR: 0.0005, Elapsed Time: 1193.11 seconds Step 51700/150000, Loss: 4.282319388389587, Test Loss: 4.293144941329956, LR: 0.0005, Elapsed Time: 1195.37 seconds Step 51800/150000, Loss: 4.276028394699097, Test Loss: 4.296373128890991, LR: 0.0005, Elapsed Time: 1197.63 seconds Step 51900/150000, Loss: 4.286621956825257, Test Loss: 4.296044588088989, LR: 0.0005, Elapsed Time: 1199.89 seconds Step 52000/150000, Loss: 4.281609315872192, Test Loss: 4.29373174905777, LR: 0.0005, Elapsed Time: 1202.14 seconds Step 52100/150000, Loss: 4.2929958248138425, Test Loss: 4.293322920799255, LR: 0.0005, Elapsed Time: 1204.39 seconds Step 52200/150000, Loss: 4.2763641691207885, Test Loss: 4.292064666748047, LR: 0.0005, Elapsed Time: 1206.65 seconds Step 52300/150000, Loss: 4.2873149061203, Test Loss: 4.2942036390304565, LR: 0.0005, Elapsed Time: 1208.91 seconds Step 52400/150000, Loss: 4.27870436668396, Test Loss: 4.293150901794434, LR: 0.0005, Elapsed Time: 1211.17 seconds Step 52500/150000, Loss: 4.27475821018219, Test Loss: 4.294040203094482, LR: 0.0005, Elapsed Time: 1213.42 seconds Step 52600/150000, Loss: 4.279656143188476, Test Loss: 4.291780233383179, LR: 0.0005, Elapsed Time: 1215.68 seconds Step 52700/150000, Loss: 4.284947547912598, Test Loss: 4.2921406626701355, LR: 0.0005, Elapsed Time: 1217.94 seconds Step 52800/150000, Loss: 4.274126410484314, Test Loss: 4.291645884513855, LR: 0.0005, Elapsed Time: 1220.19 seconds Step 52900/150000, Loss: 4.2855939769744875, Test Loss: 4.288549721240997, LR: 0.0005, Elapsed Time: 1222.46 seconds Step 53000/150000, Loss: 4.288906826972961, Test Loss: 4.2872984409332275, LR: 0.0005, Elapsed Time: 1224.72 seconds Step 53100/150000, Loss: 4.277085776329041, Test Loss: 4.289853036403656, LR: 0.0005, Elapsed Time: 1226.98 seconds Step 53200/150000, Loss: 4.2650732946395875, Test Loss: 4.290336012840271, LR: 0.0005, Elapsed Time: 1229.23 seconds Step 53300/150000, Loss: 4.282463402748108, Test Loss: 4.291793465614319, LR: 0.0005, Elapsed Time: 1231.49 seconds Step 53400/150000, Loss: 4.277775783538818, Test Loss: 4.288947403430939, LR: 0.0005, Elapsed Time: 1233.74 seconds Step 53500/150000, Loss: 4.2813373470306395, Test Loss: 4.289871454238892, LR: 0.0005, Elapsed Time: 1235.99 seconds Step 53600/150000, Loss: 4.272387638092041, Test Loss: 4.2893460392951965, LR: 0.0005, Elapsed Time: 1238.25 seconds Step 53700/150000, Loss: 4.271070289611816, Test Loss: 4.287985622882843, LR: 0.0005, Elapsed Time: 1240.50 seconds Step 53800/150000, Loss: 4.277113018035888, Test Loss: 4.285191774368286, LR: 0.0005, Elapsed Time: 1242.75 seconds Step 53900/150000, Loss: 4.279895906448364, Test Loss: 4.287278652191162, LR: 0.0005, Elapsed Time: 1245.01 seconds Step 54000/150000, Loss: 4.289020504951477, Test Loss: 4.2908695936203, LR: 0.0005, Elapsed Time: 1247.26 seconds Step 54100/150000, Loss: 4.2718474531173705, Test Loss: 4.286797940731049, LR: 0.0005, Elapsed Time: 1249.52 seconds Step 54200/150000, Loss: 4.278998055458069, Test Loss: 4.289202332496643, LR: 0.0005, Elapsed Time: 1251.78 seconds Step 54300/150000, Loss: 4.284941325187683, Test Loss: 4.28798234462738, LR: 0.0005, Elapsed Time: 1254.04 seconds Step 54400/150000, Loss: 4.279777746200562, Test Loss: 4.284647881984711, LR: 0.0005, Elapsed Time: 1256.30 seconds Step 54500/150000, Loss: 4.278353910446167, Test Loss: 4.2842806577682495, LR: 0.0005, Elapsed Time: 1258.55 seconds Step 54600/150000, Loss: 4.279137897491455, Test Loss: 4.282984972000122, LR: 0.0005, Elapsed Time: 1260.81 seconds Step 54700/150000, Loss: 4.291395797729492, Test Loss: 4.284887790679932, LR: 0.0005, Elapsed Time: 1263.07 seconds Step 54800/150000, Loss: 4.2699970579147335, Test Loss: 4.283645868301392, LR: 0.0005, Elapsed Time: 1265.32 seconds Step 54900/150000, Loss: 4.274279046058655, Test Loss: 4.285517752170563, LR: 0.0005, Elapsed Time: 1267.58 seconds Step 55000/150000, Loss: 4.273794631958008, Test Loss: 4.286359906196594, LR: 0.0005, Elapsed Time: 1269.83 seconds Step 55100/150000, Loss: 4.272040634155274, Test Loss: 4.28643536567688, LR: 0.0005, Elapsed Time: 1272.09 seconds Step 55200/150000, Loss: 4.276530518531799, Test Loss: 4.286962449550629, LR: 0.0005, Elapsed Time: 1274.34 seconds Step 55300/150000, Loss: 4.2793438482284545, Test Loss: 4.282910645008087, LR: 0.0005, Elapsed Time: 1276.59 seconds Step 55400/150000, Loss: 4.281969718933105, Test Loss: 4.28590327501297, LR: 0.0005, Elapsed Time: 1278.85 seconds Step 55500/150000, Loss: 4.271054005622863, Test Loss: 4.282890796661377, LR: 0.0005, Elapsed Time: 1281.12 seconds Step 55600/150000, Loss: 4.271356587409973, Test Loss: 4.284083545207977, LR: 0.0005, Elapsed Time: 1283.39 seconds Step 55700/150000, Loss: 4.2747852325439455, Test Loss: 4.285087823867798, LR: 0.0005, Elapsed Time: 1285.65 seconds Step 55800/150000, Loss: 4.26105140209198, Test Loss: 4.256367921829224, LR: 0.00015, Elapsed Time: 1287.92 seconds Step 55900/150000, Loss: 4.252844052314758, Test Loss: 4.250095009803772, LR: 0.00015, Elapsed Time: 1290.21 seconds Step 56000/150000, Loss: 4.2438448762893675, Test Loss: 4.245615065097809, LR: 0.00015, Elapsed Time: 1292.51 seconds Step 56100/150000, Loss: 4.240270538330078, Test Loss: 4.243906319141388, LR: 0.00015, Elapsed Time: 1294.79 seconds Step 56200/150000, Loss: 4.225126395225525, Test Loss: 4.241440773010254, LR: 0.00015, Elapsed Time: 1297.07 seconds Step 56300/150000, Loss: 4.234741721153259, Test Loss: 4.238715648651123, LR: 0.00015, Elapsed Time: 1299.35 seconds Step 56400/150000, Loss: 4.234879446029663, Test Loss: 4.2398247718811035, LR: 0.00015, Elapsed Time: 1301.67 seconds Step 56500/150000, Loss: 4.229254274368286, Test Loss: 4.238689839839935, LR: 0.00015, Elapsed Time: 1304.02 seconds Step 56600/150000, Loss: 4.224518599510193, Test Loss: 4.235697388648987, LR: 0.00015, Elapsed Time: 1306.36 seconds Step 56700/150000, Loss: 4.229177308082581, Test Loss: 4.2341548800468445, LR: 0.00015, Elapsed Time: 1308.71 seconds Step 56800/150000, Loss: 4.217815103530884, Test Loss: 4.232803821563721, LR: 0.00015, Elapsed Time: 1311.05 seconds Step 56900/150000, Loss: 4.217165818214417, Test Loss: 4.2327258586883545, LR: 0.00015, Elapsed Time: 1313.40 seconds Step 57000/150000, Loss: 4.224095649719239, Test Loss: 4.232979476451874, LR: 0.00015, Elapsed Time: 1315.74 seconds Step 57100/150000, Loss: 4.233536591529846, Test Loss: 4.231934428215027, LR: 0.00015, Elapsed Time: 1318.02 seconds Step 57200/150000, Loss: 4.220723724365234, Test Loss: 4.231417179107666, LR: 0.00015, Elapsed Time: 1320.31 seconds Step 57300/150000, Loss: 4.2090630626678465, Test Loss: 4.2289522886276245, LR: 0.00015, Elapsed Time: 1322.59 seconds Step 57400/150000, Loss: 4.217414813041687, Test Loss: 4.228771507740021, LR: 0.00015, Elapsed Time: 1324.87 seconds Step 57500/150000, Loss: 4.212612953186035, Test Loss: 4.228639662265778, LR: 0.00015, Elapsed Time: 1327.20 seconds Step 57600/150000, Loss: 4.219487075805664, Test Loss: 4.226571977138519, LR: 0.00015, Elapsed Time: 1329.48 seconds Step 57700/150000, Loss: 4.217920389175415, Test Loss: 4.2268940806388855, LR: 0.00015, Elapsed Time: 1331.75 seconds Step 57800/150000, Loss: 4.217710056304932, Test Loss: 4.227491021156311, LR: 0.00015, Elapsed Time: 1334.01 seconds Step 57900/150000, Loss: 4.218990325927734, Test Loss: 4.226440668106079, LR: 0.00015, Elapsed Time: 1336.27 seconds Step 58000/150000, Loss: 4.213360466957092, Test Loss: 4.226251065731049, LR: 0.00015, Elapsed Time: 1338.52 seconds Step 58100/150000, Loss: 4.211919903755188, Test Loss: 4.22649621963501, LR: 0.00015, Elapsed Time: 1340.80 seconds Step 58200/150000, Loss: 4.226600728034973, Test Loss: 4.22525155544281, LR: 0.00015, Elapsed Time: 1343.10 seconds Step 58300/150000, Loss: 4.207865252494812, Test Loss: 4.225117564201355, LR: 0.00015, Elapsed Time: 1345.40 seconds Step 58400/150000, Loss: 4.200153794288635, Test Loss: 4.224831163883209, LR: 0.00015, Elapsed Time: 1347.70 seconds Step 58500/150000, Loss: 4.2101742458343505, Test Loss: 4.224505662918091, LR: 0.00015, Elapsed Time: 1350.03 seconds Step 58600/150000, Loss: 4.212024908065796, Test Loss: 4.223738431930542, LR: 0.00015, Elapsed Time: 1352.33 seconds Step 58700/150000, Loss: 4.207591762542725, Test Loss: 4.222892165184021, LR: 0.00015, Elapsed Time: 1354.63 seconds Step 58800/150000, Loss: 4.213626570701599, Test Loss: 4.222259938716888, LR: 0.00015, Elapsed Time: 1356.93 seconds Step 58900/150000, Loss: 4.2063585472106935, Test Loss: 4.222109913825989, LR: 0.00015, Elapsed Time: 1359.22 seconds Step 59000/150000, Loss: 4.206003046035766, Test Loss: 4.2226569056510925, LR: 0.00015, Elapsed Time: 1361.55 seconds Step 59100/150000, Loss: 4.199745130538941, Test Loss: 4.22198611497879, LR: 0.00015, Elapsed Time: 1363.91 seconds Step 59200/150000, Loss: 4.222770915031433, Test Loss: 4.220167696475983, LR: 0.00015, Elapsed Time: 1366.21 seconds Step 59300/150000, Loss: 4.207590050697327, Test Loss: 4.220933794975281, LR: 0.00015, Elapsed Time: 1368.51 seconds Step 59400/150000, Loss: 4.2070338153839115, Test Loss: 4.220721244812012, LR: 0.00015, Elapsed Time: 1370.81 seconds Step 59500/150000, Loss: 4.2150557279586796, Test Loss: 4.221657812595367, LR: 0.00015, Elapsed Time: 1373.10 seconds Step 59600/150000, Loss: 4.207626829147339, Test Loss: 4.220588862895966, LR: 0.00015, Elapsed Time: 1375.44 seconds Step 59700/150000, Loss: 4.192788934707641, Test Loss: 4.221073865890503, LR: 0.00015, Elapsed Time: 1377.80 seconds Step 59800/150000, Loss: 4.203920621871948, Test Loss: 4.219778835773468, LR: 0.00015, Elapsed Time: 1380.10 seconds Step 59900/150000, Loss: 4.2117523765563964, Test Loss: 4.219462692737579, LR: 0.00015, Elapsed Time: 1382.40 seconds Step 60000/150000, Loss: 4.200174951553345, Test Loss: 4.218112945556641, LR: 0.00015, Elapsed Time: 1384.71 seconds Step 60100/150000, Loss: 4.216748452186584, Test Loss: 4.2169124484062195, LR: 0.00015, Elapsed Time: 1387.01 seconds Step 60200/150000, Loss: 4.205207386016846, Test Loss: 4.2168344259262085, LR: 0.00015, Elapsed Time: 1389.34 seconds Step 60300/150000, Loss: 4.201261177062988, Test Loss: 4.216366231441498, LR: 0.00015, Elapsed Time: 1391.71 seconds Step 60400/150000, Loss: 4.21155424118042, Test Loss: 4.216029942035675, LR: 0.00015, Elapsed Time: 1394.04 seconds Step 60500/150000, Loss: 4.212946329116821, Test Loss: 4.216948986053467, LR: 0.00015, Elapsed Time: 1396.33 seconds Step 60600/150000, Loss: 4.208104743957519, Test Loss: 4.217379570007324, LR: 0.00015, Elapsed Time: 1398.63 seconds Step 60700/150000, Loss: 4.199538240432739, Test Loss: 4.216808080673218, LR: 0.00015, Elapsed Time: 1400.93 seconds Step 60800/150000, Loss: 4.196260404586792, Test Loss: 4.2163145542144775, LR: 0.00015, Elapsed Time: 1403.25 seconds Step 60900/150000, Loss: 4.20731478691101, Test Loss: 4.2167311906814575, LR: 0.00015, Elapsed Time: 1405.61 seconds Step 61000/150000, Loss: 4.20639634847641, Test Loss: 4.216674447059631, LR: 0.00015, Elapsed Time: 1407.96 seconds Step 61100/150000, Loss: 4.218208889961243, Test Loss: 4.215912401676178, LR: 0.00015, Elapsed Time: 1410.26 seconds Step 61200/150000, Loss: 4.1875737857818605, Test Loss: 4.215150713920593, LR: 0.00015, Elapsed Time: 1412.56 seconds Step 61300/150000, Loss: 4.1913869667053225, Test Loss: 4.215145826339722, LR: 0.00015, Elapsed Time: 1414.87 seconds Step 61400/150000, Loss: 4.200045986175537, Test Loss: 4.216356039047241, LR: 0.00015, Elapsed Time: 1417.18 seconds Step 61500/150000, Loss: 4.200293171405792, Test Loss: 4.215396702289581, LR: 0.00015, Elapsed Time: 1419.53 seconds Step 61600/150000, Loss: 4.209177780151367, Test Loss: 4.2142613530159, LR: 0.00015, Elapsed Time: 1421.88 seconds Step 61700/150000, Loss: 4.2029937601089475, Test Loss: 4.214731395244598, LR: 0.00015, Elapsed Time: 1424.22 seconds Step 61800/150000, Loss: 4.203972702026367, Test Loss: 4.216581642627716, LR: 0.00015, Elapsed Time: 1426.57 seconds Step 61900/150000, Loss: 4.196447958946228, Test Loss: 4.214561104774475, LR: 0.00015, Elapsed Time: 1428.93 seconds Step 62000/150000, Loss: 4.201084742546081, Test Loss: 4.21578323841095, LR: 0.00015, Elapsed Time: 1431.24 seconds Step 62100/150000, Loss: 4.188150429725647, Test Loss: 4.213602006435394, LR: 0.00015, Elapsed Time: 1433.54 seconds Step 62200/150000, Loss: 4.199957070350647, Test Loss: 4.21334707736969, LR: 0.00015, Elapsed Time: 1435.85 seconds Step 62300/150000, Loss: 4.2121735954284665, Test Loss: 4.212685286998749, LR: 0.00015, Elapsed Time: 1438.14 seconds Step 62400/150000, Loss: 4.2070403575897215, Test Loss: 4.211356699466705, LR: 0.00015, Elapsed Time: 1440.47 seconds Step 62500/150000, Loss: 4.200024271011353, Test Loss: 4.2123571038246155, LR: 0.00015, Elapsed Time: 1442.84 seconds Step 62600/150000, Loss: 4.20683177947998, Test Loss: 4.2109445333480835, LR: 0.00015, Elapsed Time: 1445.17 seconds Step 62700/150000, Loss: 4.191628580093384, Test Loss: 4.211583733558655, LR: 0.00015, Elapsed Time: 1447.48 seconds Step 62800/150000, Loss: 4.198788709640503, Test Loss: 4.213821351528168, LR: 0.00015, Elapsed Time: 1449.77 seconds Step 62900/150000, Loss: 4.203905386924744, Test Loss: 4.212483704090118, LR: 0.00015, Elapsed Time: 1452.06 seconds Step 63000/150000, Loss: 4.190712304115295, Test Loss: 4.211158335208893, LR: 0.00015, Elapsed Time: 1454.36 seconds Step 63100/150000, Loss: 4.198591251373291, Test Loss: 4.211651086807251, LR: 0.00015, Elapsed Time: 1456.70 seconds Step 63200/150000, Loss: 4.194988412857056, Test Loss: 4.2121482491493225, LR: 0.00015, Elapsed Time: 1459.02 seconds Step 63300/150000, Loss: 4.196587152481079, Test Loss: 4.212681710720062, LR: 0.00015, Elapsed Time: 1461.29 seconds Step 63400/150000, Loss: 4.19078818321228, Test Loss: 4.212838590145111, LR: 0.00015, Elapsed Time: 1463.55 seconds Step 63500/150000, Loss: 4.199465398788452, Test Loss: 4.211721777915955, LR: 0.00015, Elapsed Time: 1465.81 seconds Step 63600/150000, Loss: 4.1937617540359495, Test Loss: 4.205797553062439, LR: 4.4999999999999996e-05, Elapsed Time: 1468.07 seconds Step 63700/150000, Loss: 4.193711538314819, Test Loss: 4.204088509082794, LR: 4.4999999999999996e-05, Elapsed Time: 1470.38 seconds Step 63800/150000, Loss: 4.195339937210083, Test Loss: 4.2023826241493225, LR: 4.4999999999999996e-05, Elapsed Time: 1472.72 seconds Step 63900/150000, Loss: 4.1855806183815005, Test Loss: 4.201195538043976, LR: 4.4999999999999996e-05, Elapsed Time: 1474.98 seconds Step 64000/150000, Loss: 4.173540389537811, Test Loss: 4.201583683490753, LR: 4.4999999999999996e-05, Elapsed Time: 1477.23 seconds Step 64100/150000, Loss: 4.190892176628113, Test Loss: 4.200059652328491, LR: 4.4999999999999996e-05, Elapsed Time: 1479.48 seconds Step 64200/150000, Loss: 4.178377614021302, Test Loss: 4.199001610279083, LR: 4.4999999999999996e-05, Elapsed Time: 1481.75 seconds Step 64300/150000, Loss: 4.187793996334076, Test Loss: 4.198975205421448, LR: 4.4999999999999996e-05, Elapsed Time: 1484.02 seconds Step 64400/150000, Loss: 4.182182192802429, Test Loss: 4.197923541069031, LR: 4.4999999999999996e-05, Elapsed Time: 1486.36 seconds Step 64500/150000, Loss: 4.187610676288605, Test Loss: 4.198365688323975, LR: 4.4999999999999996e-05, Elapsed Time: 1488.66 seconds Step 64600/150000, Loss: 4.169953179359436, Test Loss: 4.19732666015625, LR: 4.4999999999999996e-05, Elapsed Time: 1490.97 seconds Step 64700/150000, Loss: 4.184415249824524, Test Loss: 4.196868598461151, LR: 4.4999999999999996e-05, Elapsed Time: 1493.24 seconds Step 64800/150000, Loss: 4.192635822296142, Test Loss: 4.1971317529678345, LR: 4.4999999999999996e-05, Elapsed Time: 1495.50 seconds Step 64900/150000, Loss: 4.1842339015007015, Test Loss: 4.196547865867615, LR: 4.4999999999999996e-05, Elapsed Time: 1497.74 seconds Step 65000/150000, Loss: 4.195247287750244, Test Loss: 4.196508586406708, LR: 4.4999999999999996e-05, Elapsed Time: 1500.00 seconds Step 65100/150000, Loss: 4.183826994895935, Test Loss: 4.196314454078674, LR: 4.4999999999999996e-05, Elapsed Time: 1502.25 seconds Step 65200/150000, Loss: 4.193687920570373, Test Loss: 4.195671737194061, LR: 4.4999999999999996e-05, Elapsed Time: 1504.50 seconds Step 65300/150000, Loss: 4.180678551197052, Test Loss: 4.196056485176086, LR: 4.4999999999999996e-05, Elapsed Time: 1506.74 seconds Step 65400/150000, Loss: 4.1790737795829775, Test Loss: 4.194997191429138, LR: 4.4999999999999996e-05, Elapsed Time: 1508.99 seconds Step 65500/150000, Loss: 4.179036350250244, Test Loss: 4.1949655413627625, LR: 4.4999999999999996e-05, Elapsed Time: 1511.25 seconds Step 65600/150000, Loss: 4.182316365242005, Test Loss: 4.194827854633331, LR: 4.4999999999999996e-05, Elapsed Time: 1513.51 seconds Step 65700/150000, Loss: 4.177092885971069, Test Loss: 4.194788575172424, LR: 4.4999999999999996e-05, Elapsed Time: 1515.77 seconds Step 65800/150000, Loss: 4.178902359008789, Test Loss: 4.19433069229126, LR: 4.4999999999999996e-05, Elapsed Time: 1518.02 seconds Step 65900/150000, Loss: 4.176541385650634, Test Loss: 4.193676054477692, LR: 4.4999999999999996e-05, Elapsed Time: 1520.27 seconds Step 66000/150000, Loss: 4.172586746215821, Test Loss: 4.193652272224426, LR: 4.4999999999999996e-05, Elapsed Time: 1522.53 seconds Step 66100/150000, Loss: 4.189567394256592, Test Loss: 4.1932530999183655, LR: 4.4999999999999996e-05, Elapsed Time: 1524.79 seconds Step 66200/150000, Loss: 4.173743147850036, Test Loss: 4.193467915058136, LR: 4.4999999999999996e-05, Elapsed Time: 1527.05 seconds Step 66300/150000, Loss: 4.18244366645813, Test Loss: 4.193166792392731, LR: 4.4999999999999996e-05, Elapsed Time: 1529.30 seconds Step 66400/150000, Loss: 4.177334651947022, Test Loss: 4.193263471126556, LR: 4.4999999999999996e-05, Elapsed Time: 1531.56 seconds Step 66500/150000, Loss: 4.173064517974853, Test Loss: 4.193430066108704, LR: 4.4999999999999996e-05, Elapsed Time: 1533.83 seconds Step 66600/150000, Loss: 4.1676596879959105, Test Loss: 4.193590760231018, LR: 4.4999999999999996e-05, Elapsed Time: 1536.08 seconds Step 66700/150000, Loss: 4.16289067029953, Test Loss: 4.193353295326233, LR: 4.4999999999999996e-05, Elapsed Time: 1538.33 seconds Step 66800/150000, Loss: 4.173309009075165, Test Loss: 4.192658722400665, LR: 4.4999999999999996e-05, Elapsed Time: 1540.59 seconds Step 66900/150000, Loss: 4.17137951374054, Test Loss: 4.1930952072143555, LR: 4.4999999999999996e-05, Elapsed Time: 1542.85 seconds Step 67000/150000, Loss: 4.16524498462677, Test Loss: 4.19273054599762, LR: 4.4999999999999996e-05, Elapsed Time: 1545.11 seconds Step 67100/150000, Loss: 4.186121792793274, Test Loss: 4.19260835647583, LR: 4.4999999999999996e-05, Elapsed Time: 1547.36 seconds Step 67200/150000, Loss: 4.174588141441345, Test Loss: 4.192525923252106, LR: 4.4999999999999996e-05, Elapsed Time: 1549.62 seconds Step 67300/150000, Loss: 4.174054827690124, Test Loss: 4.192156732082367, LR: 4.4999999999999996e-05, Elapsed Time: 1551.88 seconds Step 67400/150000, Loss: 4.180992479324341, Test Loss: 4.192418336868286, LR: 4.4999999999999996e-05, Elapsed Time: 1554.14 seconds Step 67500/150000, Loss: 4.16102970123291, Test Loss: 4.191923558712006, LR: 4.4999999999999996e-05, Elapsed Time: 1556.40 seconds Step 67600/150000, Loss: 4.168528938293457, Test Loss: 4.1924185156822205, LR: 4.4999999999999996e-05, Elapsed Time: 1558.64 seconds Step 67700/150000, Loss: 4.18139458656311, Test Loss: 4.191772639751434, LR: 4.4999999999999996e-05, Elapsed Time: 1560.89 seconds Step 67800/150000, Loss: 4.1733353281021115, Test Loss: 4.190920054912567, LR: 4.4999999999999996e-05, Elapsed Time: 1563.15 seconds Step 67900/150000, Loss: 4.170492596626282, Test Loss: 4.190254330635071, LR: 4.4999999999999996e-05, Elapsed Time: 1565.40 seconds Step 68000/150000, Loss: 4.173598787784576, Test Loss: 4.191294074058533, LR: 4.4999999999999996e-05, Elapsed Time: 1567.65 seconds Step 68100/150000, Loss: 4.176061182022095, Test Loss: 4.191223740577698, LR: 4.4999999999999996e-05, Elapsed Time: 1569.91 seconds Step 68200/150000, Loss: 4.165406932830811, Test Loss: 4.1910040974617, LR: 4.4999999999999996e-05, Elapsed Time: 1572.16 seconds Step 68300/150000, Loss: 4.165218558311462, Test Loss: 4.191352427005768, LR: 4.4999999999999996e-05, Elapsed Time: 1574.41 seconds Step 68400/150000, Loss: 4.169144473075867, Test Loss: 4.191169917583466, LR: 4.4999999999999996e-05, Elapsed Time: 1576.67 seconds Step 68500/150000, Loss: 4.163315939903259, Test Loss: 4.190852105617523, LR: 4.4999999999999996e-05, Elapsed Time: 1578.93 seconds Step 68600/150000, Loss: 4.164966177940369, Test Loss: 4.190586149692535, LR: 4.4999999999999996e-05, Elapsed Time: 1581.19 seconds Step 68700/150000, Loss: 4.164487910270691, Test Loss: 4.190276265144348, LR: 4.4999999999999996e-05, Elapsed Time: 1583.44 seconds Step 68800/150000, Loss: 4.175600085258484, Test Loss: 4.190540432929993, LR: 4.4999999999999996e-05, Elapsed Time: 1585.70 seconds Step 68900/150000, Loss: 4.1664080142974855, Test Loss: 4.191003978252411, LR: 4.4999999999999996e-05, Elapsed Time: 1587.96 seconds Step 69000/150000, Loss: 4.172815842628479, Test Loss: 4.190832018852234, LR: 4.4999999999999996e-05, Elapsed Time: 1590.21 seconds Step 69100/150000, Loss: 4.17620402097702, Test Loss: 4.189656436443329, LR: 1.3499999999999998e-05, Elapsed Time: 1592.48 seconds Step 69200/150000, Loss: 4.175076713562012, Test Loss: 4.189188003540039, LR: 1.3499999999999998e-05, Elapsed Time: 1594.74 seconds Step 69300/150000, Loss: 4.1632839918136595, Test Loss: 4.1887317299842834, LR: 1.3499999999999998e-05, Elapsed Time: 1597.00 seconds Step 69400/150000, Loss: 4.166015331745148, Test Loss: 4.188775420188904, LR: 1.3499999999999998e-05, Elapsed Time: 1599.26 seconds Step 69500/150000, Loss: 4.176146988868713, Test Loss: 4.188641548156738, LR: 1.3499999999999998e-05, Elapsed Time: 1601.51 seconds Step 69600/150000, Loss: 4.165975031852722, Test Loss: 4.188302516937256, LR: 1.3499999999999998e-05, Elapsed Time: 1603.77 seconds Step 69700/150000, Loss: 4.170336475372315, Test Loss: 4.188397824764252, LR: 1.3499999999999998e-05, Elapsed Time: 1606.02 seconds Step 69800/150000, Loss: 4.159849877357483, Test Loss: 4.1881818771362305, LR: 1.3499999999999998e-05, Elapsed Time: 1608.27 seconds Step 69900/150000, Loss: 4.1636023783683775, Test Loss: 4.187955558300018, LR: 1.3499999999999998e-05, Elapsed Time: 1610.54 seconds Step 70000/150000, Loss: 4.17270215511322, Test Loss: 4.187670826911926, LR: 1.3499999999999998e-05, Elapsed Time: 1612.81 seconds Step 70100/150000, Loss: 4.153892529010773, Test Loss: 4.187719225883484, LR: 1.3499999999999998e-05, Elapsed Time: 1615.06 seconds Step 70200/150000, Loss: 4.166352891921997, Test Loss: 4.188002347946167, LR: 1.3499999999999998e-05, Elapsed Time: 1617.31 seconds Step 70300/150000, Loss: 4.167788014411927, Test Loss: 4.187557637691498, LR: 1.3499999999999998e-05, Elapsed Time: 1619.57 seconds Step 70400/150000, Loss: 4.163088550567627, Test Loss: 4.187866806983948, LR: 1.3499999999999998e-05, Elapsed Time: 1621.83 seconds Step 70500/150000, Loss: 4.170417153835297, Test Loss: 4.187388598918915, LR: 1.3499999999999998e-05, Elapsed Time: 1624.07 seconds Step 70600/150000, Loss: 4.163733742237091, Test Loss: 4.187606155872345, LR: 1.3499999999999998e-05, Elapsed Time: 1626.33 seconds Step 70700/150000, Loss: 4.154749577045441, Test Loss: 4.187497496604919, LR: 1.3499999999999998e-05, Elapsed Time: 1628.58 seconds Step 70800/150000, Loss: 4.168543229103088, Test Loss: 4.187367916107178, LR: 1.3499999999999998e-05, Elapsed Time: 1630.85 seconds Step 70900/150000, Loss: 4.151924521923065, Test Loss: 4.187236428260803, LR: 1.3499999999999998e-05, Elapsed Time: 1633.10 seconds Step 71000/150000, Loss: 4.159869809150695, Test Loss: 4.187285423278809, LR: 1.3499999999999998e-05, Elapsed Time: 1635.36 seconds Step 71100/150000, Loss: 4.163783664703369, Test Loss: 4.187179327011108, LR: 1.3499999999999998e-05, Elapsed Time: 1637.62 seconds Step 71200/150000, Loss: 4.166450257301331, Test Loss: 4.187014579772949, LR: 1.3499999999999998e-05, Elapsed Time: 1639.88 seconds Step 71300/150000, Loss: 4.154879701137543, Test Loss: 4.187195658683777, LR: 1.3499999999999998e-05, Elapsed Time: 1642.14 seconds Step 71400/150000, Loss: 4.154628357887268, Test Loss: 4.187214434146881, LR: 1.3499999999999998e-05, Elapsed Time: 1644.40 seconds Step 71500/150000, Loss: 4.156673998832702, Test Loss: 4.187053978443146, LR: 1.3499999999999998e-05, Elapsed Time: 1646.65 seconds Step 71600/150000, Loss: 4.170062417984009, Test Loss: 4.18699049949646, LR: 1.3499999999999998e-05, Elapsed Time: 1648.91 seconds Step 71700/150000, Loss: 4.157615098953247, Test Loss: 4.187010586261749, LR: 1.3499999999999998e-05, Elapsed Time: 1651.17 seconds Step 71800/150000, Loss: 4.1667045307159425, Test Loss: 4.186932444572449, LR: 1.3499999999999998e-05, Elapsed Time: 1653.43 seconds Step 71900/150000, Loss: 4.160829153060913, Test Loss: 4.1869842410087585, LR: 1.3499999999999998e-05, Elapsed Time: 1655.68 seconds Step 72000/150000, Loss: 4.1560625028610225, Test Loss: 4.1870726346969604, LR: 1.3499999999999998e-05, Elapsed Time: 1657.94 seconds Step 72100/150000, Loss: 4.155453739166259, Test Loss: 4.186811983585358, LR: 5e-06, Elapsed Time: 1660.20 seconds Step 72200/150000, Loss: 4.163777863979339, Test Loss: 4.186845123767853, LR: 5e-06, Elapsed Time: 1662.46 seconds Step 72300/150000, Loss: 4.163998665809632, Test Loss: 4.186637043952942, LR: 5e-06, Elapsed Time: 1664.72 seconds Step 72400/150000, Loss: 4.152436180114746, Test Loss: 4.186588704586029, LR: 5e-06, Elapsed Time: 1666.99 seconds Step 72500/150000, Loss: 4.145992007255554, Test Loss: 4.186447501182556, LR: 5e-06, Elapsed Time: 1669.25 seconds Step 72600/150000, Loss: 4.156301505565644, Test Loss: 4.1863210797309875, LR: 5e-06, Elapsed Time: 1671.51 seconds Step 72700/150000, Loss: 4.146506154537201, Test Loss: 4.186343193054199, LR: 5e-06, Elapsed Time: 1673.78 seconds Step 72800/150000, Loss: 4.161303169727326, Test Loss: 4.1862775683403015, LR: 5e-06, Elapsed Time: 1676.05 seconds Step 72900/150000, Loss: 4.147039752006531, Test Loss: 4.186254918575287, LR: 5e-06, Elapsed Time: 1678.30 seconds Step 73000/150000, Loss: 4.1482584238052365, Test Loss: 4.186004459857941, LR: 5e-06, Elapsed Time: 1680.56 seconds Step 73100/150000, Loss: 4.156509981155396, Test Loss: 4.18611466884613, LR: 5e-06, Elapsed Time: 1682.83 seconds Step 73200/150000, Loss: 4.149578437805176, Test Loss: 4.186041355133057, LR: 5e-06, Elapsed Time: 1685.08 seconds Step 73300/150000, Loss: 4.154433546066284, Test Loss: 4.185933411121368, LR: 5e-06, Elapsed Time: 1687.34 seconds Step 73400/150000, Loss: 4.151546678543091, Test Loss: 4.186014235019684, LR: 5e-06, Elapsed Time: 1689.60 seconds Step 73500/150000, Loss: 4.138417809009552, Test Loss: 4.185975909233093, LR: 5e-06, Elapsed Time: 1691.87 seconds Step 73600/150000, Loss: 4.14903832912445, Test Loss: 4.185869216918945, LR: 5e-06, Elapsed Time: 1694.13 seconds Step 73700/150000, Loss: 4.14946346282959, Test Loss: 4.185847580432892, LR: 5e-06, Elapsed Time: 1696.38 seconds Step 73800/150000, Loss: 4.150955853462219, Test Loss: 4.185855984687805, LR: 5e-06, Elapsed Time: 1698.63 seconds Step 73900/150000, Loss: 4.151494860649109, Test Loss: 4.185780227184296, LR: 5e-06, Elapsed Time: 1700.89 seconds Step 74000/150000, Loss: 4.155793912410736, Test Loss: 4.18576967716217, LR: 5e-06, Elapsed Time: 1703.13 seconds Step 74100/150000, Loss: 4.149139895439148, Test Loss: 4.185834169387817, LR: 5e-06, Elapsed Time: 1705.40 seconds Step 74200/150000, Loss: 4.147499079704285, Test Loss: 4.185734748840332, LR: 5e-06, Elapsed Time: 1707.67 seconds Step 74300/150000, Loss: 4.1459405851364135, Test Loss: 4.18561190366745, LR: 5e-06, Elapsed Time: 1709.94 seconds Step 74400/150000, Loss: 4.140422947406769, Test Loss: 4.185671329498291, LR: 5e-06, Elapsed Time: 1712.19 seconds Step 74500/150000, Loss: 4.151492872238159, Test Loss: 4.185814380645752, LR: 5e-06, Elapsed Time: 1714.45 seconds Step 74600/150000, Loss: 4.152321305274963, Test Loss: 4.185787260532379, LR: 5e-06, Elapsed Time: 1716.71 seconds Step 74700/150000, Loss: 4.1408957719802855, Test Loss: 4.1856489181518555, LR: 5e-06, Elapsed Time: 1718.96 seconds Step 74800/150000, Loss: 4.148496265411377, Test Loss: 4.185710787773132, LR: 5e-06, Elapsed Time: 1721.21 seconds Step 74900/150000, Loss: 4.156952781677246, Test Loss: 4.185557126998901, LR: 5e-06, Elapsed Time: 1723.46 seconds Step 75000/150000, Loss: 4.13872855424881, Test Loss: 4.185812056064606, LR: 5e-06, Elapsed Time: 1725.71 seconds Step 75100/150000, Loss: 4.13018795967102, Test Loss: 4.1857094168663025, LR: 5e-06, Elapsed Time: 1727.97 seconds Step 75200/150000, Loss: 4.148472566604614, Test Loss: 4.185604155063629, LR: 5e-06, Elapsed Time: 1730.21 seconds Step 75300/150000, Loss: 4.145106382369995, Test Loss: 4.1856414675712585, LR: 5e-06, Elapsed Time: 1732.46 seconds Step 75400/150000, Loss: 4.140119795799255, Test Loss: 4.185621082782745, LR: 5e-06, Elapsed Time: 1734.70 seconds Step 75500/150000, Loss: 4.141389188766479, Test Loss: 4.185708940029144, LR: 5e-06, Elapsed Time: 1736.94 seconds Step 75600/150000, Loss: 4.130235946178436, Test Loss: 4.185527324676514, LR: 5e-06, Elapsed Time: 1739.20 seconds Step 75700/150000, Loss: 4.145750904083252, Test Loss: 4.185644090175629, LR: 5e-06, Elapsed Time: 1741.47 seconds Step 75800/150000, Loss: 4.143891983032226, Test Loss: 4.185523331165314, LR: 5e-06, Elapsed Time: 1743.73 seconds Step 75900/150000, Loss: 4.148354635238648, Test Loss: 4.185573399066925, LR: 5e-06, Elapsed Time: 1745.99 seconds Step 76000/150000, Loss: 4.142433052062988, Test Loss: 4.18557071685791, LR: 5e-06, Elapsed Time: 1748.25 seconds Step 76100/150000, Loss: 4.14603440284729, Test Loss: 4.185494124889374, LR: 5e-06, Elapsed Time: 1750.51 seconds Step 76200/150000, Loss: 4.144834017753601, Test Loss: 4.185433208942413, LR: 5e-06, Elapsed Time: 1752.76 seconds Step 76300/150000, Loss: 4.137983393669129, Test Loss: 4.185459196567535, LR: 5e-06, Elapsed Time: 1755.01 seconds Step 76400/150000, Loss: 4.146612558364868, Test Loss: 4.185526669025421, LR: 5e-06, Elapsed Time: 1757.26 seconds Step 76500/150000, Loss: 4.143929324150085, Test Loss: 4.185377299785614, LR: 5e-06, Elapsed Time: 1759.52 seconds Step 76600/150000, Loss: 4.150637731552124, Test Loss: 4.185503304004669, LR: 5e-06, Elapsed Time: 1761.78 seconds Step 76700/150000, Loss: 4.126633954048157, Test Loss: 4.1854047775268555, LR: 5e-06, Elapsed Time: 1764.03 seconds Step 76800/150000, Loss: 4.135329990386963, Test Loss: 4.185335278511047, LR: 5e-06, Elapsed Time: 1766.29 seconds Step 76900/150000, Loss: 4.139923787117004, Test Loss: 4.185485482215881, LR: 5e-06, Elapsed Time: 1768.55 seconds Step 77000/150000, Loss: 4.12781970500946, Test Loss: 4.185505151748657, LR: 5e-06, Elapsed Time: 1770.80 seconds Step 77100/150000, Loss: 4.136189231872558, Test Loss: 4.185498297214508, LR: 5e-06, Elapsed Time: 1773.06 seconds Step 77200/150000, Loss: 4.139810266494751, Test Loss: 4.1855310797691345, LR: 5e-06, Elapsed Time: 1775.32 seconds Step 77300/150000, Loss: 4.139953150749206, Test Loss: 4.185369312763214, LR: 5e-06, Elapsed Time: 1777.60 seconds Step 77400/150000, Loss: 4.126514599323273, Test Loss: 4.185569941997528, LR: 5e-06, Elapsed Time: 1779.85 seconds Step 77500/150000, Loss: 4.128481736183167, Test Loss: 4.18559056520462, LR: 5e-06, Elapsed Time: 1782.12 seconds Step 77600/150000, Loss: 4.1440234065055845, Test Loss: 4.185402929782867, LR: 5e-06, Elapsed Time: 1784.37 seconds Step 77700/150000, Loss: 4.17192126750946, Test Loss: 4.185186088085175, LR: 5e-06, Elapsed Time: 1786.62 seconds Step 77800/150000, Loss: 4.177226414680481, Test Loss: 4.1850749254226685, LR: 5e-06, Elapsed Time: 1788.88 seconds Step 77900/150000, Loss: 4.159675400257111, Test Loss: 4.185079574584961, LR: 5e-06, Elapsed Time: 1791.15 seconds Step 78000/150000, Loss: 4.166853523254394, Test Loss: 4.185042679309845, LR: 5e-06, Elapsed Time: 1793.40 seconds Step 78100/150000, Loss: 4.158627812862396, Test Loss: 4.185085773468018, LR: 5e-06, Elapsed Time: 1795.67 seconds Step 78200/150000, Loss: 4.165379462242126, Test Loss: 4.1849653124809265, LR: 5e-06, Elapsed Time: 1797.91 seconds Step 78300/150000, Loss: 4.169523615837097, Test Loss: 4.1850961446762085, LR: 5e-06, Elapsed Time: 1800.18 seconds Step 78400/150000, Loss: 4.15878701210022, Test Loss: 4.185166776180267, LR: 5e-06, Elapsed Time: 1802.45 seconds Step 78500/150000, Loss: 4.164877481460572, Test Loss: 4.1851197481155396, LR: 5e-06, Elapsed Time: 1804.71 seconds Step 78600/150000, Loss: 4.1620083427429195, Test Loss: 4.185107350349426, LR: 5e-06, Elapsed Time: 1806.97 seconds Step 78700/150000, Loss: 4.158428740501404, Test Loss: 4.185027480125427, LR: 5e-06, Elapsed Time: 1809.23 seconds Step 78800/150000, Loss: 4.15625657081604, Test Loss: 4.185107469558716, LR: 5e-06, Elapsed Time: 1811.50 seconds Step 78900/150000, Loss: 4.161324915885925, Test Loss: 4.185108244419098, LR: 5e-06, Elapsed Time: 1813.76 seconds Step 79000/150000, Loss: 4.17138774394989, Test Loss: 4.185007154941559, LR: 5e-06, Elapsed Time: 1816.03 seconds Step 79100/150000, Loss: 4.1593982076644895, Test Loss: 4.1851208209991455, LR: 5e-06, Elapsed Time: 1818.29 seconds Step 79200/150000, Loss: 4.152566609382629, Test Loss: 4.184969425201416, LR: 5e-06, Elapsed Time: 1820.55 seconds Step 79300/150000, Loss: 4.15707941532135, Test Loss: 4.1849387884140015, LR: 5e-06, Elapsed Time: 1822.83 seconds Step 79400/150000, Loss: 4.157061395645141, Test Loss: 4.184900522232056, LR: 5e-06, Elapsed Time: 1825.07 seconds Step 79500/150000, Loss: 4.159421668052674, Test Loss: 4.184811770915985, LR: 5e-06, Elapsed Time: 1827.33 seconds Step 79600/150000, Loss: 4.164738459587097, Test Loss: 4.1846354603767395, LR: 5e-06, Elapsed Time: 1829.59 seconds Step 79700/150000, Loss: 4.152469029426575, Test Loss: 4.1847057938575745, LR: 5e-06, Elapsed Time: 1831.86 seconds Step 79800/150000, Loss: 4.165579719543457, Test Loss: 4.184767484664917, LR: 5e-06, Elapsed Time: 1834.12 seconds Step 79900/150000, Loss: 4.158928921222687, Test Loss: 4.184775948524475, LR: 5e-06, Elapsed Time: 1836.38 seconds Step 80000/150000, Loss: 4.1548524618148805, Test Loss: 4.184822261333466, LR: 5e-06, Elapsed Time: 1838.63 seconds Step 80100/150000, Loss: 4.164975750446319, Test Loss: 4.184824824333191, LR: 5e-06, Elapsed Time: 1840.89 seconds Step 80200/150000, Loss: 4.155625972747803, Test Loss: 4.184835255146027, LR: 5e-06, Elapsed Time: 1843.14 seconds Step 80300/150000, Loss: 4.142808480262756, Test Loss: 4.1847898960113525, LR: 5e-06, Elapsed Time: 1845.39 seconds Step 80400/150000, Loss: 4.154113845825195, Test Loss: 4.1848562359809875, LR: 5e-06, Elapsed Time: 1847.65 seconds Step 80500/150000, Loss: 4.158547351360321, Test Loss: 4.184978127479553, LR: 5e-06, Elapsed Time: 1849.91 seconds Step 80600/150000, Loss: 4.154414210319519, Test Loss: 4.184956610202789, LR: 5e-06, Elapsed Time: 1852.17 seconds Step 80700/150000, Loss: 4.1591877746582036, Test Loss: 4.184852719306946, LR: 5e-06, Elapsed Time: 1854.42 seconds Step 80800/150000, Loss: 4.149473810195923, Test Loss: 4.184854865074158, LR: 5e-06, Elapsed Time: 1856.68 seconds Step 80900/150000, Loss: 4.149453635215759, Test Loss: 4.1848613023757935, LR: 5e-06, Elapsed Time: 1858.94 seconds Step 81000/150000, Loss: 4.148298058509827, Test Loss: 4.184964954853058, LR: 5e-06, Elapsed Time: 1861.21 seconds Step 81100/150000, Loss: 4.1712527132034305, Test Loss: 4.1847540736198425, LR: 5e-06, Elapsed Time: 1863.47 seconds Step 81200/150000, Loss: 4.144808578491211, Test Loss: 4.184790372848511, LR: 5e-06, Elapsed Time: 1865.73 seconds Step 81300/150000, Loss: 4.157258317470551, Test Loss: 4.18467253446579, LR: 5e-06, Elapsed Time: 1868.00 seconds Step 81400/150000, Loss: 4.1620003652572635, Test Loss: 4.184818148612976, LR: 5e-06, Elapsed Time: 1870.26 seconds Step 81500/150000, Loss: 4.143972134590149, Test Loss: 4.184668600559235, LR: 5e-06, Elapsed Time: 1872.53 seconds Step 81600/150000, Loss: 4.146478779315949, Test Loss: 4.1846619844436646, LR: 5e-06, Elapsed Time: 1874.79 seconds Step 81700/150000, Loss: 4.1465218019485475, Test Loss: 4.184607565402985, LR: 5e-06, Elapsed Time: 1877.06 seconds Step 81800/150000, Loss: 4.162878937721253, Test Loss: 4.184605896472931, LR: 5e-06, Elapsed Time: 1879.33 seconds Step 81900/150000, Loss: 4.150522487163544, Test Loss: 4.184553503990173, LR: 5e-06, Elapsed Time: 1881.60 seconds Step 82000/150000, Loss: 4.1588552379608155, Test Loss: 4.184423923492432, LR: 5e-06, Elapsed Time: 1883.85 seconds Step 82100/150000, Loss: 4.152663180828094, Test Loss: 4.1844266057014465, LR: 5e-06, Elapsed Time: 1886.12 seconds Step 82200/150000, Loss: 4.149004077911377, Test Loss: 4.184476971626282, LR: 5e-06, Elapsed Time: 1888.38 seconds Step 82300/150000, Loss: 4.158762063980102, Test Loss: 4.184332311153412, LR: 5e-06, Elapsed Time: 1890.64 seconds Step 82400/150000, Loss: 4.157428297996521, Test Loss: 4.184389889240265, LR: 5e-06, Elapsed Time: 1892.89 seconds Step 82500/150000, Loss: 4.149614143371582, Test Loss: 4.184585332870483, LR: 5e-06, Elapsed Time: 1895.15 seconds Step 82600/150000, Loss: 4.148299100399018, Test Loss: 4.184548437595367, LR: 5e-06, Elapsed Time: 1897.41 seconds Step 82700/150000, Loss: 4.144604506492615, Test Loss: 4.184411883354187, LR: 5e-06, Elapsed Time: 1899.68 seconds Step 82800/150000, Loss: 4.15677001953125, Test Loss: 4.1845786571502686, LR: 5e-06, Elapsed Time: 1901.96 seconds Step 82900/150000, Loss: 4.152357552051544, Test Loss: 4.1844412088394165, LR: 5e-06, Elapsed Time: 1904.22 seconds Step 83000/150000, Loss: 4.163792767524719, Test Loss: 4.18435275554657, LR: 5e-06, Elapsed Time: 1906.48 seconds Step 83100/150000, Loss: 4.129580693244934, Test Loss: 4.184370577335358, LR: 5e-06, Elapsed Time: 1908.73 seconds Step 83200/150000, Loss: 4.138106846809388, Test Loss: 4.184455990791321, LR: 5e-06, Elapsed Time: 1910.99 seconds Step 83300/150000, Loss: 4.148853743076325, Test Loss: 4.184538960456848, LR: 5e-06, Elapsed Time: 1913.25 seconds Step 83400/150000, Loss: 4.148370714187622, Test Loss: 4.1843520402908325, LR: 5e-06, Elapsed Time: 1915.50 seconds Step 83500/150000, Loss: 4.1542469501495365, Test Loss: 4.184362471103668, LR: 5e-06, Elapsed Time: 1917.76 seconds Step 83600/150000, Loss: 4.147538938522339, Test Loss: 4.184317409992218, LR: 5e-06, Elapsed Time: 1920.00 seconds Step 83700/150000, Loss: 4.154468150138855, Test Loss: 4.184334456920624, LR: 5e-06, Elapsed Time: 1922.26 seconds Step 83800/150000, Loss: 4.1390941715240475, Test Loss: 4.184287846088409, LR: 5e-06, Elapsed Time: 1924.51 seconds Step 83900/150000, Loss: 4.14953473329544, Test Loss: 4.184296190738678, LR: 5e-06, Elapsed Time: 1926.78 seconds Step 84000/150000, Loss: 4.134500205516815, Test Loss: 4.184376776218414, LR: 5e-06, Elapsed Time: 1929.04 seconds Step 84100/150000, Loss: 4.148134100437164, Test Loss: 4.184421479701996, LR: 5e-06, Elapsed Time: 1931.30 seconds Step 84200/150000, Loss: 4.155234177112579, Test Loss: 4.184223532676697, LR: 5e-06, Elapsed Time: 1933.56 seconds Step 84300/150000, Loss: 4.150139055252075, Test Loss: 4.184188902378082, LR: 5e-06, Elapsed Time: 1935.82 seconds Step 84400/150000, Loss: 4.147102508544922, Test Loss: 4.184170186519623, LR: 5e-06, Elapsed Time: 1938.07 seconds Step 84500/150000, Loss: 4.150843300819397, Test Loss: 4.1841921210289, LR: 5e-06, Elapsed Time: 1940.34 seconds Step 84600/150000, Loss: 4.136047978401184, Test Loss: 4.184291422367096, LR: 5e-06, Elapsed Time: 1942.59 seconds Step 84700/150000, Loss: 4.146226062774658, Test Loss: 4.184329807758331, LR: 5e-06, Elapsed Time: 1944.85 seconds Step 84800/150000, Loss: 4.1453132462501525, Test Loss: 4.1841800808906555, LR: 5e-06, Elapsed Time: 1947.11 seconds Step 84900/150000, Loss: 4.1269403743743895, Test Loss: 4.184243679046631, LR: 5e-06, Elapsed Time: 1949.37 seconds Step 85000/150000, Loss: 4.1517276263237, Test Loss: 4.184073567390442, LR: 5e-06, Elapsed Time: 1951.62 seconds Step 85100/150000, Loss: 4.132954428195953, Test Loss: 4.184085786342621, LR: 5e-06, Elapsed Time: 1953.88 seconds Step 85200/150000, Loss: 4.142291564941406, Test Loss: 4.184191584587097, LR: 5e-06, Elapsed Time: 1956.14 seconds Step 85300/150000, Loss: 4.135158443450928, Test Loss: 4.184432625770569, LR: 5e-06, Elapsed Time: 1958.39 seconds Step 85400/150000, Loss: 4.148129022121429, Test Loss: 4.18435400724411, LR: 5e-06, Elapsed Time: 1960.66 seconds Step 85500/150000, Loss: 4.162870740890503, Test Loss: 4.184195160865784, LR: 5e-06, Elapsed Time: 1962.92 seconds Step 85600/150000, Loss: 4.168416609764099, Test Loss: 4.184270918369293, LR: 5e-06, Elapsed Time: 1965.17 seconds Step 85700/150000, Loss: 4.1615572762489315, Test Loss: 4.184202194213867, LR: 5e-06, Elapsed Time: 1967.43 seconds Step 85800/150000, Loss: 4.161231870651245, Test Loss: 4.184149622917175, LR: 5e-06, Elapsed Time: 1969.69 seconds Step 85900/150000, Loss: 4.149889643192291, Test Loss: 4.184277296066284, LR: 5e-06, Elapsed Time: 1971.95 seconds Step 86000/150000, Loss: 4.163598098754883, Test Loss: 4.1842080950737, LR: 5e-06, Elapsed Time: 1974.21 seconds Step 86100/150000, Loss: 4.158399243354797, Test Loss: 4.184106409549713, LR: 5e-06, Elapsed Time: 1976.46 seconds Step 86200/150000, Loss: 4.1555857706069945, Test Loss: 4.1841961145401, LR: 5e-06, Elapsed Time: 1978.71 seconds Step 86300/150000, Loss: 4.165143394470215, Test Loss: 4.184028744697571, LR: 5e-06, Elapsed Time: 1980.97 seconds Step 86400/150000, Loss: 4.153741688728332, Test Loss: 4.184206783771515, LR: 5e-06, Elapsed Time: 1983.23 seconds Step 86500/150000, Loss: 4.149925351142883, Test Loss: 4.184090852737427, LR: 5e-06, Elapsed Time: 1985.49 seconds Step 86600/150000, Loss: 4.172379155158996, Test Loss: 4.183995306491852, LR: 5e-06, Elapsed Time: 1987.74 seconds Step 86700/150000, Loss: 4.160453786849976, Test Loss: 4.184096992015839, LR: 5e-06, Elapsed Time: 1989.98 seconds Step 86800/150000, Loss: 4.165561594963074, Test Loss: 4.18401825428009, LR: 5e-06, Elapsed Time: 1992.24 seconds Step 86900/150000, Loss: 4.171698975563049, Test Loss: 4.18390291929245, LR: 5e-06, Elapsed Time: 1994.49 seconds Step 87000/150000, Loss: 4.157347211837768, Test Loss: 4.184033155441284, LR: 5e-06, Elapsed Time: 1996.75 seconds Step 87100/150000, Loss: 4.176164464950562, Test Loss: 4.183975279331207, LR: 5e-06, Elapsed Time: 1999.00 seconds Step 87200/150000, Loss: 4.156792795658111, Test Loss: 4.183990776538849, LR: 5e-06, Elapsed Time: 2001.25 seconds Step 87300/150000, Loss: 4.163677983283996, Test Loss: 4.184051275253296, LR: 5e-06, Elapsed Time: 2003.50 seconds Step 87400/150000, Loss: 4.142675805091858, Test Loss: 4.184219241142273, LR: 5e-06, Elapsed Time: 2005.76 seconds Step 87500/150000, Loss: 4.171035342216491, Test Loss: 4.1840837597846985, LR: 5e-06, Elapsed Time: 2008.02 seconds Step 87600/150000, Loss: 4.149381923675537, Test Loss: 4.184144616127014, LR: 5e-06, Elapsed Time: 2010.28 seconds Step 87700/150000, Loss: 4.1552663946151736, Test Loss: 4.18398654460907, LR: 5e-06, Elapsed Time: 2012.53 seconds Step 87800/150000, Loss: 4.161367602348328, Test Loss: 4.183851420879364, LR: 5e-06, Elapsed Time: 2014.79 seconds Step 87900/150000, Loss: 4.151983733177185, Test Loss: 4.183811783790588, LR: 5e-06, Elapsed Time: 2017.03 seconds Step 88000/150000, Loss: 4.16240348815918, Test Loss: 4.183785676956177, LR: 5e-06, Elapsed Time: 2019.28 seconds Step 88100/150000, Loss: 4.153958985805511, Test Loss: 4.183819830417633, LR: 5e-06, Elapsed Time: 2021.55 seconds Step 88200/150000, Loss: 4.164287581443786, Test Loss: 4.183813691139221, LR: 5e-06, Elapsed Time: 2023.80 seconds Step 88300/150000, Loss: 4.157302579879761, Test Loss: 4.18381541967392, LR: 5e-06, Elapsed Time: 2026.06 seconds Step 88400/150000, Loss: 4.150865890979767, Test Loss: 4.183937609195709, LR: 5e-06, Elapsed Time: 2028.32 seconds Step 88500/150000, Loss: 4.139501476287842, Test Loss: 4.183914661407471, LR: 5e-06, Elapsed Time: 2030.58 seconds Step 88600/150000, Loss: 4.152716546058655, Test Loss: 4.183979272842407, LR: 5e-06, Elapsed Time: 2032.84 seconds Step 88700/150000, Loss: 4.150099024772644, Test Loss: 4.183803200721741, LR: 5e-06, Elapsed Time: 2035.11 seconds Step 88800/150000, Loss: 4.145555081367493, Test Loss: 4.183856308460236, LR: 5e-06, Elapsed Time: 2037.37 seconds Step 88900/150000, Loss: 4.148350715637207, Test Loss: 4.183845579624176, LR: 5e-06, Elapsed Time: 2039.62 seconds Step 89000/150000, Loss: 4.168212909698486, Test Loss: 4.183781623840332, LR: 5e-06, Elapsed Time: 2041.87 seconds Step 89100/150000, Loss: 4.150783843994141, Test Loss: 4.183813631534576, LR: 5e-06, Elapsed Time: 2044.13 seconds Step 89200/150000, Loss: 4.152151687145233, Test Loss: 4.183742582798004, LR: 5e-06, Elapsed Time: 2046.38 seconds Step 89300/150000, Loss: 4.159282503128051, Test Loss: 4.183992564678192, LR: 5e-06, Elapsed Time: 2048.63 seconds Step 89400/150000, Loss: 4.1408341693878175, Test Loss: 4.1839078068733215, LR: 5e-06, Elapsed Time: 2050.90 seconds Step 89500/150000, Loss: 4.149708156585693, Test Loss: 4.183818161487579, LR: 5e-06, Elapsed Time: 2053.15 seconds Step 89600/150000, Loss: 4.164428277015686, Test Loss: 4.183773219585419, LR: 5e-06, Elapsed Time: 2055.41 seconds Step 89700/150000, Loss: 4.146211037635803, Test Loss: 4.1836448311805725, LR: 5e-06, Elapsed Time: 2057.67 seconds Step 89800/150000, Loss: 4.154671576023102, Test Loss: 4.183655142784119, LR: 5e-06, Elapsed Time: 2059.93 seconds Step 89900/150000, Loss: 4.152160489559174, Test Loss: 4.183666348457336, LR: 5e-06, Elapsed Time: 2062.19 seconds Step 90000/150000, Loss: 4.15120795249939, Test Loss: 4.1836735010147095, LR: 5e-06, Elapsed Time: 2064.45 seconds Step 90100/150000, Loss: 4.146942694187164, Test Loss: 4.183748006820679, LR: 5e-06, Elapsed Time: 2066.72 seconds Step 90200/150000, Loss: 4.1466611385345455, Test Loss: 4.183781325817108, LR: 5e-06, Elapsed Time: 2068.97 seconds Step 90300/150000, Loss: 4.144896731376648, Test Loss: 4.183779418468475, LR: 5e-06, Elapsed Time: 2071.22 seconds Step 90400/150000, Loss: 4.1472083902359005, Test Loss: 4.183737754821777, LR: 5e-06, Elapsed Time: 2073.48 seconds Step 90500/150000, Loss: 4.143563628196716, Test Loss: 4.1837891936302185, LR: 5e-06, Elapsed Time: 2075.73 seconds Step 90600/150000, Loss: 4.144126205444336, Test Loss: 4.183690130710602, LR: 5e-06, Elapsed Time: 2077.99 seconds Step 90700/150000, Loss: 4.152432131767273, Test Loss: 4.183613836765289, LR: 5e-06, Elapsed Time: 2080.25 seconds Step 90800/150000, Loss: 4.14737756729126, Test Loss: 4.183684706687927, LR: 5e-06, Elapsed Time: 2082.51 seconds Step 90900/150000, Loss: 4.1547225856781, Test Loss: 4.183634161949158, LR: 5e-06, Elapsed Time: 2084.76 seconds Step 91000/150000, Loss: 4.166209473609924, Test Loss: 4.183639109134674, LR: 5e-06, Elapsed Time: 2087.02 seconds Step 91100/150000, Loss: 4.166218295097351, Test Loss: 4.183577358722687, LR: 5e-06, Elapsed Time: 2089.28 seconds Step 91200/150000, Loss: 4.1572181057929996, Test Loss: 4.183479726314545, LR: 5e-06, Elapsed Time: 2091.54 seconds Step 91300/150000, Loss: 4.156561076641083, Test Loss: 4.183564603328705, LR: 5e-06, Elapsed Time: 2093.79 seconds Step 91400/150000, Loss: 4.1613597536087035, Test Loss: 4.1836132407188416, LR: 5e-06, Elapsed Time: 2096.05 seconds Step 91500/150000, Loss: 4.164888398647308, Test Loss: 4.1835437417030334, LR: 5e-06, Elapsed Time: 2098.30 seconds Step 91600/150000, Loss: 4.148808798789978, Test Loss: 4.183633506298065, LR: 5e-06, Elapsed Time: 2100.56 seconds Step 91700/150000, Loss: 4.153900547027588, Test Loss: 4.183574736118317, LR: 5e-06, Elapsed Time: 2102.82 seconds Step 91800/150000, Loss: 4.1588416147232055, Test Loss: 4.1834452748298645, LR: 5e-06, Elapsed Time: 2105.08 seconds Step 91900/150000, Loss: 4.153540086746216, Test Loss: 4.183582305908203, LR: 5e-06, Elapsed Time: 2107.35 seconds Step 92000/150000, Loss: 4.151847333908081, Test Loss: 4.183577001094818, LR: 5e-06, Elapsed Time: 2109.61 seconds Step 92100/150000, Loss: 4.160052318572998, Test Loss: 4.183630883693695, LR: 5e-06, Elapsed Time: 2111.87 seconds Step 92200/150000, Loss: 4.159687283039093, Test Loss: 4.183535873889923, LR: 5e-06, Elapsed Time: 2114.13 seconds Step 92300/150000, Loss: 4.163098940849304, Test Loss: 4.183574140071869, LR: 5e-06, Elapsed Time: 2116.38 seconds Step 92400/150000, Loss: 4.163488910198212, Test Loss: 4.183368504047394, LR: 5e-06, Elapsed Time: 2118.64 seconds Step 92500/150000, Loss: 4.142394866943359, Test Loss: 4.183571219444275, LR: 5e-06, Elapsed Time: 2120.89 seconds Step 92600/150000, Loss: 4.146866979598999, Test Loss: 4.183548986911774, LR: 5e-06, Elapsed Time: 2123.15 seconds Step 92700/150000, Loss: 4.160304365158081, Test Loss: 4.18349426984787, LR: 5e-06, Elapsed Time: 2125.41 seconds Step 92800/150000, Loss: 4.1493293571472165, Test Loss: 4.183519005775452, LR: 5e-06, Elapsed Time: 2127.67 seconds Step 92900/150000, Loss: 4.155476098060608, Test Loss: 4.18343722820282, LR: 5e-06, Elapsed Time: 2129.92 seconds Step 93000/150000, Loss: 4.1512046146392825, Test Loss: 4.183416903018951, LR: 5e-06, Elapsed Time: 2132.18 seconds Step 93100/150000, Loss: 4.159032220840454, Test Loss: 4.183447957038879, LR: 5e-06, Elapsed Time: 2134.43 seconds Step 93200/150000, Loss: 4.1487909698486325, Test Loss: 4.183544456958771, LR: 5e-06, Elapsed Time: 2136.68 seconds Step 93300/150000, Loss: 4.1415836405754085, Test Loss: 4.183523654937744, LR: 5e-06, Elapsed Time: 2138.93 seconds Step 93400/150000, Loss: 4.153966732025147, Test Loss: 4.183479368686676, LR: 5e-06, Elapsed Time: 2141.18 seconds Step 93500/150000, Loss: 4.1623581027984615, Test Loss: 4.183531105518341, LR: 5e-06, Elapsed Time: 2143.44 seconds Step 93600/150000, Loss: 4.150136847496032, Test Loss: 4.183545231819153, LR: 5e-06, Elapsed Time: 2145.69 seconds Step 93700/150000, Loss: 4.15760573387146, Test Loss: 4.1834911704063416, LR: 5e-06, Elapsed Time: 2147.94 seconds Step 93800/150000, Loss: 4.156633849143982, Test Loss: 4.183627188205719, LR: 5e-06, Elapsed Time: 2150.19 seconds Step 93900/150000, Loss: 4.14601954460144, Test Loss: 4.18370133638382, LR: 5e-06, Elapsed Time: 2152.45 seconds Step 94000/150000, Loss: 4.1531533432006835, Test Loss: 4.183591663837433, LR: 5e-06, Elapsed Time: 2154.70 seconds Step 94100/150000, Loss: 4.159339241981506, Test Loss: 4.183757185935974, LR: 5e-06, Elapsed Time: 2156.96 seconds Step 94200/150000, Loss: 4.159772534370422, Test Loss: 4.183677673339844, LR: 5e-06, Elapsed Time: 2159.20 seconds Step 94300/150000, Loss: 4.141176972389221, Test Loss: 4.183652222156525, LR: 5e-06, Elapsed Time: 2161.46 seconds Step 94400/150000, Loss: 4.147612719535828, Test Loss: 4.183518052101135, LR: 5e-06, Elapsed Time: 2163.71 seconds Step 94500/150000, Loss: 4.153586602210998, Test Loss: 4.183511316776276, LR: 5e-06, Elapsed Time: 2165.96 seconds Step 94600/150000, Loss: 4.143413186073303, Test Loss: 4.183470606803894, LR: 5e-06, Elapsed Time: 2168.21 seconds Step 94700/150000, Loss: 4.151809849739075, Test Loss: 4.183469653129578, LR: 5e-06, Elapsed Time: 2170.47 seconds Step 94800/150000, Loss: 4.144912786483765, Test Loss: 4.183407485485077, LR: 5e-06, Elapsed Time: 2172.72 seconds Step 94900/150000, Loss: 4.1524891185760495, Test Loss: 4.18336409330368, LR: 5e-06, Elapsed Time: 2174.98 seconds Step 95000/150000, Loss: 4.143587417602539, Test Loss: 4.183496832847595, LR: 5e-06, Elapsed Time: 2177.23 seconds Step 95100/150000, Loss: 4.148751626014709, Test Loss: 4.18340665102005, LR: 5e-06, Elapsed Time: 2179.49 seconds Step 95200/150000, Loss: 4.154482169151306, Test Loss: 4.1832475662231445, LR: 5e-06, Elapsed Time: 2181.75 seconds Step 95300/150000, Loss: 4.137493896484375, Test Loss: 4.183410286903381, LR: 5e-06, Elapsed Time: 2184.00 seconds Step 95400/150000, Loss: 4.147489614486695, Test Loss: 4.183423221111298, LR: 5e-06, Elapsed Time: 2186.27 seconds Step 95500/150000, Loss: 4.137350788116455, Test Loss: 4.183321833610535, LR: 5e-06, Elapsed Time: 2188.52 seconds Step 95600/150000, Loss: 4.148268222808838, Test Loss: 4.183333575725555, LR: 5e-06, Elapsed Time: 2190.78 seconds Step 95700/150000, Loss: 4.145018610954285, Test Loss: 4.183397889137268, LR: 5e-06, Elapsed Time: 2193.03 seconds Step 95800/150000, Loss: 4.1544324398040775, Test Loss: 4.1832996010780334, LR: 5e-06, Elapsed Time: 2195.29 seconds Step 95900/150000, Loss: 4.143790442943573, Test Loss: 4.183314323425293, LR: 5e-06, Elapsed Time: 2197.55 seconds Step 96000/150000, Loss: 4.150312061309815, Test Loss: 4.1833232045173645, LR: 5e-06, Elapsed Time: 2199.80 seconds Step 96100/150000, Loss: 4.144095764160157, Test Loss: 4.183301091194153, LR: 5e-06, Elapsed Time: 2202.05 seconds Step 96200/150000, Loss: 4.1410860824584965, Test Loss: 4.183230519294739, LR: 5e-06, Elapsed Time: 2204.30 seconds Step 96300/150000, Loss: 4.136436696052551, Test Loss: 4.183328568935394, LR: 5e-06, Elapsed Time: 2206.56 seconds Step 96400/150000, Loss: 4.14696117401123, Test Loss: 4.1834517121315, LR: 5e-06, Elapsed Time: 2208.81 seconds Step 96500/150000, Loss: 4.152370727062225, Test Loss: 4.183340787887573, LR: 5e-06, Elapsed Time: 2211.07 seconds Step 96600/150000, Loss: 4.138738629817962, Test Loss: 4.183277189731598, LR: 5e-06, Elapsed Time: 2213.33 seconds Step 96700/150000, Loss: 4.148123893737793, Test Loss: 4.183371961116791, LR: 5e-06, Elapsed Time: 2215.59 seconds Step 96800/150000, Loss: 4.149331035614014, Test Loss: 4.183218777179718, LR: 5e-06, Elapsed Time: 2217.85 seconds Step 96900/150000, Loss: 4.1331688594818115, Test Loss: 4.183494389057159, LR: 5e-06, Elapsed Time: 2220.12 seconds Step 97000/150000, Loss: 4.129537415504456, Test Loss: 4.18336820602417, LR: 5e-06, Elapsed Time: 2222.37 seconds Step 97100/150000, Loss: 4.145774412155151, Test Loss: 4.183375716209412, LR: 5e-06, Elapsed Time: 2224.63 seconds Step 97200/150000, Loss: 4.141331381797791, Test Loss: 4.1834118366241455, LR: 5e-06, Elapsed Time: 2226.89 seconds Step 97300/150000, Loss: 4.133864457607269, Test Loss: 4.183435499668121, LR: 5e-06, Elapsed Time: 2229.14 seconds Step 97400/150000, Loss: 4.137330584526062, Test Loss: 4.183455765247345, LR: 5e-06, Elapsed Time: 2231.41 seconds Step 97500/150000, Loss: 4.133285634517669, Test Loss: 4.183301746845245, LR: 5e-06, Elapsed Time: 2233.66 seconds Step 97600/150000, Loss: 4.144105651378632, Test Loss: 4.183429062366486, LR: 5e-06, Elapsed Time: 2235.92 seconds Step 97700/150000, Loss: 4.145221314430237, Test Loss: 4.18340539932251, LR: 5e-06, Elapsed Time: 2238.18 seconds Step 97800/150000, Loss: 4.136519889831543, Test Loss: 4.183468163013458, LR: 5e-06, Elapsed Time: 2240.44 seconds Step 97900/150000, Loss: 4.14711178779602, Test Loss: 4.183388352394104, LR: 5e-06, Elapsed Time: 2242.71 seconds Step 98000/150000, Loss: 4.1363042449951175, Test Loss: 4.1832743883132935, LR: 5e-06, Elapsed Time: 2244.96 seconds Step 98100/150000, Loss: 4.14573924779892, Test Loss: 4.183207273483276, LR: 5e-06, Elapsed Time: 2247.22 seconds Step 98200/150000, Loss: 4.12712021112442, Test Loss: 4.183431684970856, LR: 5e-06, Elapsed Time: 2249.47 seconds Step 98300/150000, Loss: 4.148455491065979, Test Loss: 4.183330535888672, LR: 5e-06, Elapsed Time: 2251.73 seconds Step 98400/150000, Loss: 4.149047136306763, Test Loss: 4.18329918384552, LR: 5e-06, Elapsed Time: 2253.99 seconds Step 98500/150000, Loss: 4.14116295337677, Test Loss: 4.183337092399597, LR: 5e-06, Elapsed Time: 2256.25 seconds Step 98600/150000, Loss: 4.122706413269043, Test Loss: 4.183327257633209, LR: 5e-06, Elapsed Time: 2258.51 seconds Step 98700/150000, Loss: 4.1349200654029845, Test Loss: 4.183311998844147, LR: 5e-06, Elapsed Time: 2260.78 seconds Step 98800/150000, Loss: 4.13211941242218, Test Loss: 4.1833677887916565, LR: 5e-06, Elapsed Time: 2263.01 seconds Step 98900/150000, Loss: 4.130132937431336, Test Loss: 4.183439254760742, LR: 5e-06, Elapsed Time: 2265.26 seconds Step 99000/150000, Loss: 4.137482266426087, Test Loss: 4.183430910110474, LR: 5e-06, Elapsed Time: 2267.52 seconds Step 99100/150000, Loss: 4.135607285499573, Test Loss: 4.183497905731201, LR: 5e-06, Elapsed Time: 2269.78 seconds Step 99200/150000, Loss: 4.1319965386390685, Test Loss: 4.183333694934845, LR: 5e-06, Elapsed Time: 2272.03 seconds Step 99300/150000, Loss: 4.131804468631745, Test Loss: 4.183463931083679, LR: 5e-06, Elapsed Time: 2274.29 seconds Step 99400/150000, Loss: 4.123924491405487, Test Loss: 4.1835731863975525, LR: 5e-06, Elapsed Time: 2276.54 seconds Step 99500/150000, Loss: 4.153701310157776, Test Loss: 4.183386862277985, LR: 5e-06, Elapsed Time: 2278.80 seconds Step 99600/150000, Loss: 4.165082664489746, Test Loss: 4.183206021785736, LR: 5e-06, Elapsed Time: 2281.06 seconds Step 99700/150000, Loss: 4.170464701652527, Test Loss: 4.183067500591278, LR: 5e-06, Elapsed Time: 2283.31 seconds Step 99800/150000, Loss: 4.164787769317627, Test Loss: 4.183200180530548, LR: 5e-06, Elapsed Time: 2285.57 seconds Step 99900/150000, Loss: 4.15875910282135, Test Loss: 4.183159351348877, LR: 5e-06, Elapsed Time: 2287.84 seconds Step 100000/150000, Loss: 4.157420461177826, Test Loss: 4.183087050914764, LR: 5e-06, Elapsed Time: 2290.09 seconds Saving model checkpoint at step 100000 Step 100100/150000, Loss: 4.166806497573853, Test Loss: 4.1830713748931885, LR: 5e-06, Elapsed Time: 2292.42 seconds Step 100200/150000, Loss: 4.159164929389954, Test Loss: 4.183219075202942, LR: 5e-06, Elapsed Time: 2294.68 seconds Step 100300/150000, Loss: 4.162308721542359, Test Loss: 4.183288931846619, LR: 5e-06, Elapsed Time: 2296.95 seconds Step 100400/150000, Loss: 4.1558391690254215, Test Loss: 4.183073103427887, LR: 5e-06, Elapsed Time: 2299.20 seconds Step 100500/150000, Loss: 4.16154237985611, Test Loss: 4.1831682324409485, LR: 5e-06, Elapsed Time: 2301.47 seconds Step 100600/150000, Loss: 4.1520200943946834, Test Loss: 4.183123767375946, LR: 5e-06, Elapsed Time: 2303.73 seconds Step 100700/150000, Loss: 4.155621244907379, Test Loss: 4.18318635225296, LR: 5e-06, Elapsed Time: 2305.99 seconds Step 100800/150000, Loss: 4.166549751758575, Test Loss: 4.18325012922287, LR: 5e-06, Elapsed Time: 2308.24 seconds Step 100900/150000, Loss: 4.161137456893921, Test Loss: 4.183200180530548, LR: 5e-06, Elapsed Time: 2310.50 seconds Step 101000/150000, Loss: 4.153509712219238, Test Loss: 4.1831947565078735, LR: 5e-06, Elapsed Time: 2312.75 seconds Step 101100/150000, Loss: 4.149085068702698, Test Loss: 4.183147132396698, LR: 5e-06, Elapsed Time: 2315.01 seconds Step 101200/150000, Loss: 4.161673545837402, Test Loss: 4.183074116706848, LR: 5e-06, Elapsed Time: 2317.26 seconds Step 101300/150000, Loss: 4.150198931694031, Test Loss: 4.1830021142959595, LR: 5e-06, Elapsed Time: 2319.51 seconds Step 101400/150000, Loss: 4.156889805793762, Test Loss: 4.182923853397369, LR: 5e-06, Elapsed Time: 2321.77 seconds Step 101500/150000, Loss: 4.164011716842651, Test Loss: 4.1828736662864685, LR: 5e-06, Elapsed Time: 2324.02 seconds Step 101600/150000, Loss: 4.153506937026978, Test Loss: 4.182870268821716, LR: 5e-06, Elapsed Time: 2326.28 seconds Step 101700/150000, Loss: 4.154548490047455, Test Loss: 4.182972967624664, LR: 5e-06, Elapsed Time: 2328.54 seconds Step 101800/150000, Loss: 4.159119505882263, Test Loss: 4.183104932308197, LR: 5e-06, Elapsed Time: 2330.80 seconds Step 101900/150000, Loss: 4.155282237529755, Test Loss: 4.183022677898407, LR: 5e-06, Elapsed Time: 2333.06 seconds Step 102000/150000, Loss: 4.160636701583862, Test Loss: 4.183031737804413, LR: 5e-06, Elapsed Time: 2335.32 seconds Step 102100/150000, Loss: 4.151993651390075, Test Loss: 4.183083176612854, LR: 5e-06, Elapsed Time: 2337.58 seconds Step 102200/150000, Loss: 4.1405109429359435, Test Loss: 4.18294095993042, LR: 5e-06, Elapsed Time: 2339.84 seconds Step 102300/150000, Loss: 4.153458437919617, Test Loss: 4.183091461658478, LR: 5e-06, Elapsed Time: 2342.11 seconds Step 102400/150000, Loss: 4.155075144767761, Test Loss: 4.183172821998596, LR: 5e-06, Elapsed Time: 2344.36 seconds Step 102500/150000, Loss: 4.14851227760315, Test Loss: 4.183209240436554, LR: 5e-06, Elapsed Time: 2346.63 seconds Step 102600/150000, Loss: 4.158660802841187, Test Loss: 4.183114230632782, LR: 5e-06, Elapsed Time: 2348.88 seconds Step 102700/150000, Loss: 4.148131721019745, Test Loss: 4.183093965053558, LR: 5e-06, Elapsed Time: 2351.14 seconds Step 102800/150000, Loss: 4.147079696655274, Test Loss: 4.183086097240448, LR: 5e-06, Elapsed Time: 2353.39 seconds Step 102900/150000, Loss: 4.144713034629822, Test Loss: 4.183130741119385, LR: 5e-06, Elapsed Time: 2355.65 seconds Step 103000/150000, Loss: 4.164974069595337, Test Loss: 4.183102548122406, LR: 5e-06, Elapsed Time: 2357.90 seconds Step 103100/150000, Loss: 4.139935913085938, Test Loss: 4.183033466339111, LR: 5e-06, Elapsed Time: 2360.16 seconds Step 103200/150000, Loss: 4.167913370132446, Test Loss: 4.182942271232605, LR: 5e-06, Elapsed Time: 2362.42 seconds Step 103300/150000, Loss: 4.157184658050537, Test Loss: 4.183043718338013, LR: 5e-06, Elapsed Time: 2364.66 seconds Step 103400/150000, Loss: 4.13917462348938, Test Loss: 4.182898700237274, LR: 5e-06, Elapsed Time: 2366.93 seconds Step 103500/150000, Loss: 4.1383452582359315, Test Loss: 4.183037161827087, LR: 5e-06, Elapsed Time: 2369.20 seconds Step 103600/150000, Loss: 4.155209546089172, Test Loss: 4.182919442653656, LR: 5e-06, Elapsed Time: 2371.46 seconds Step 103700/150000, Loss: 4.155232775211334, Test Loss: 4.1829511523246765, LR: 5e-06, Elapsed Time: 2373.72 seconds Step 103800/150000, Loss: 4.150090489387512, Test Loss: 4.182849884033203, LR: 5e-06, Elapsed Time: 2375.98 seconds Step 103900/150000, Loss: 4.1549149203300475, Test Loss: 4.182732343673706, LR: 5e-06, Elapsed Time: 2378.24 seconds Step 104000/150000, Loss: 4.15147510766983, Test Loss: 4.182760536670685, LR: 5e-06, Elapsed Time: 2380.50 seconds Step 104100/150000, Loss: 4.149536352157593, Test Loss: 4.182769298553467, LR: 5e-06, Elapsed Time: 2382.76 seconds Step 104200/150000, Loss: 4.154743061065674, Test Loss: 4.182685673236847, LR: 5e-06, Elapsed Time: 2385.02 seconds Step 104300/150000, Loss: 4.153319602012634, Test Loss: 4.18281626701355, LR: 5e-06, Elapsed Time: 2387.29 seconds Step 104400/150000, Loss: 4.149418239593506, Test Loss: 4.182944059371948, LR: 5e-06, Elapsed Time: 2389.55 seconds Step 104500/150000, Loss: 4.146622533798218, Test Loss: 4.182862281799316, LR: 5e-06, Elapsed Time: 2391.81 seconds Step 104600/150000, Loss: 4.140997390747071, Test Loss: 4.1828389167785645, LR: 5e-06, Elapsed Time: 2394.07 seconds Step 104700/150000, Loss: 4.156354575157166, Test Loss: 4.18295294046402, LR: 5e-06, Elapsed Time: 2396.33 seconds Step 104800/150000, Loss: 4.156742265224457, Test Loss: 4.182774007320404, LR: 5e-06, Elapsed Time: 2398.59 seconds Step 104900/150000, Loss: 4.1520990371704105, Test Loss: 4.182737171649933, LR: 5e-06, Elapsed Time: 2400.86 seconds Step 105000/150000, Loss: 4.1269047069549565, Test Loss: 4.182772874832153, LR: 5e-06, Elapsed Time: 2403.11 seconds Step 105100/150000, Loss: 4.143224565982819, Test Loss: 4.182854235172272, LR: 5e-06, Elapsed Time: 2405.37 seconds Step 105200/150000, Loss: 4.142754158973694, Test Loss: 4.1828614473342896, LR: 5e-06, Elapsed Time: 2407.62 seconds Step 105300/150000, Loss: 4.156931076049805, Test Loss: 4.18276172876358, LR: 5e-06, Elapsed Time: 2409.87 seconds Step 105400/150000, Loss: 4.1452416324615475, Test Loss: 4.182724714279175, LR: 5e-06, Elapsed Time: 2412.13 seconds Step 105500/150000, Loss: 4.148775811195374, Test Loss: 4.182616174221039, LR: 5e-06, Elapsed Time: 2414.38 seconds Step 105600/150000, Loss: 4.147649869918824, Test Loss: 4.182743072509766, LR: 5e-06, Elapsed Time: 2416.64 seconds Step 105700/150000, Loss: 4.146111102104187, Test Loss: 4.182651400566101, LR: 5e-06, Elapsed Time: 2418.89 seconds Step 105800/150000, Loss: 4.136842861175537, Test Loss: 4.182799458503723, LR: 5e-06, Elapsed Time: 2421.14 seconds Step 105900/150000, Loss: 4.139220621585846, Test Loss: 4.18272477388382, LR: 5e-06, Elapsed Time: 2423.40 seconds Step 106000/150000, Loss: 4.1518226861953735, Test Loss: 4.182726562023163, LR: 5e-06, Elapsed Time: 2425.65 seconds Step 106100/150000, Loss: 4.150581097602844, Test Loss: 4.182570993900299, LR: 5e-06, Elapsed Time: 2427.89 seconds Step 106200/150000, Loss: 4.155951471328735, Test Loss: 4.182603359222412, LR: 5e-06, Elapsed Time: 2430.15 seconds Step 106300/150000, Loss: 4.144221453666687, Test Loss: 4.182539641857147, LR: 5e-06, Elapsed Time: 2432.40 seconds Step 106400/150000, Loss: 4.146421308517456, Test Loss: 4.182624638080597, LR: 5e-06, Elapsed Time: 2434.65 seconds Step 106500/150000, Loss: 4.134832427501679, Test Loss: 4.182601988315582, LR: 5e-06, Elapsed Time: 2436.91 seconds Step 106600/150000, Loss: 4.146899452209473, Test Loss: 4.182632863521576, LR: 5e-06, Elapsed Time: 2439.16 seconds Step 106700/150000, Loss: 4.138580913543701, Test Loss: 4.182562053203583, LR: 5e-06, Elapsed Time: 2441.41 seconds Step 106800/150000, Loss: 4.1315593957901005, Test Loss: 4.182645380496979, LR: 5e-06, Elapsed Time: 2443.66 seconds Step 106900/150000, Loss: 4.144743571281433, Test Loss: 4.182561278343201, LR: 5e-06, Elapsed Time: 2445.91 seconds Step 107000/150000, Loss: 4.142760043144226, Test Loss: 4.182444155216217, LR: 5e-06, Elapsed Time: 2448.16 seconds Step 107100/150000, Loss: 4.139929027557373, Test Loss: 4.182643353939056, LR: 5e-06, Elapsed Time: 2450.42 seconds Step 107200/150000, Loss: 4.132779281139374, Test Loss: 4.182783603668213, LR: 5e-06, Elapsed Time: 2452.68 seconds Step 107300/150000, Loss: 4.147402863502503, Test Loss: 4.182767808437347, LR: 5e-06, Elapsed Time: 2454.92 seconds Step 107400/150000, Loss: 4.169334177970886, Test Loss: 4.18264365196228, LR: 5e-06, Elapsed Time: 2457.18 seconds Step 107500/150000, Loss: 4.168093242645264, Test Loss: 4.182714760303497, LR: 5e-06, Elapsed Time: 2459.43 seconds Step 107600/150000, Loss: 4.153196883201599, Test Loss: 4.182579815387726, LR: 5e-06, Elapsed Time: 2461.69 seconds Step 107700/150000, Loss: 4.15484397649765, Test Loss: 4.18260645866394, LR: 5e-06, Elapsed Time: 2463.94 seconds Step 107800/150000, Loss: 4.148784837722778, Test Loss: 4.182688415050507, LR: 5e-06, Elapsed Time: 2466.20 seconds Step 107900/150000, Loss: 4.161658406257629, Test Loss: 4.18259996175766, LR: 5e-06, Elapsed Time: 2468.45 seconds Step 108000/150000, Loss: 4.164555840492248, Test Loss: 4.182552754878998, LR: 5e-06, Elapsed Time: 2470.71 seconds Step 108100/150000, Loss: 4.149423875808716, Test Loss: 4.182555258274078, LR: 5e-06, Elapsed Time: 2472.96 seconds Step 108200/150000, Loss: 4.1666610193252565, Test Loss: 4.18252694606781, LR: 5e-06, Elapsed Time: 2475.21 seconds Step 108300/150000, Loss: 4.146856908798218, Test Loss: 4.182646632194519, LR: 5e-06, Elapsed Time: 2477.47 seconds Step 108400/150000, Loss: 4.147478275299072, Test Loss: 4.182528555393219, LR: 5e-06, Elapsed Time: 2479.73 seconds Step 108500/150000, Loss: 4.169182720184327, Test Loss: 4.1823766231536865, LR: 5e-06, Elapsed Time: 2481.97 seconds Step 108600/150000, Loss: 4.164214930534363, Test Loss: 4.1825501918792725, LR: 5e-06, Elapsed Time: 2484.22 seconds Step 108700/150000, Loss: 4.165402431488037, Test Loss: 4.1824668645858765, LR: 5e-06, Elapsed Time: 2486.47 seconds Step 108800/150000, Loss: 4.164918065071106, Test Loss: 4.182389497756958, LR: 5e-06, Elapsed Time: 2488.71 seconds Step 108900/150000, Loss: 4.158822455406189, Test Loss: 4.182377517223358, LR: 5e-06, Elapsed Time: 2490.97 seconds Step 109000/150000, Loss: 4.175883777141571, Test Loss: 4.1824010014534, LR: 5e-06, Elapsed Time: 2493.22 seconds Step 109100/150000, Loss: 4.151304926872253, Test Loss: 4.182471811771393, LR: 5e-06, Elapsed Time: 2495.47 seconds Step 109200/150000, Loss: 4.163420929908752, Test Loss: 4.182552933692932, LR: 5e-06, Elapsed Time: 2497.72 seconds Step 109300/150000, Loss: 4.145566349029541, Test Loss: 4.182755529880524, LR: 5e-06, Elapsed Time: 2499.97 seconds Step 109400/150000, Loss: 4.159655032157898, Test Loss: 4.182538211345673, LR: 5e-06, Elapsed Time: 2502.23 seconds Step 109500/150000, Loss: 4.151295309066772, Test Loss: 4.182413578033447, LR: 5e-06, Elapsed Time: 2504.48 seconds Step 109600/150000, Loss: 4.151892287731171, Test Loss: 4.1823039054870605, LR: 5e-06, Elapsed Time: 2506.74 seconds Step 109700/150000, Loss: 4.15936960697174, Test Loss: 4.182279109954834, LR: 5e-06, Elapsed Time: 2509.00 seconds Step 109800/150000, Loss: 4.156202569007873, Test Loss: 4.182223737239838, LR: 5e-06, Elapsed Time: 2511.26 seconds Step 109900/150000, Loss: 4.161330714225769, Test Loss: 4.182231307029724, LR: 5e-06, Elapsed Time: 2513.51 seconds Step 110000/150000, Loss: 4.152190790176392, Test Loss: 4.182163655757904, LR: 5e-06, Elapsed Time: 2515.77 seconds Step 110100/150000, Loss: 4.1610349702835085, Test Loss: 4.18232935667038, LR: 5e-06, Elapsed Time: 2518.02 seconds Step 110200/150000, Loss: 4.15615481376648, Test Loss: 4.182272791862488, LR: 5e-06, Elapsed Time: 2520.28 seconds Step 110300/150000, Loss: 4.146832919120788, Test Loss: 4.1823113560676575, LR: 5e-06, Elapsed Time: 2522.54 seconds Step 110400/150000, Loss: 4.135546026229858, Test Loss: 4.182499408721924, LR: 5e-06, Elapsed Time: 2524.79 seconds Step 110500/150000, Loss: 4.155034236907959, Test Loss: 4.182358622550964, LR: 5e-06, Elapsed Time: 2527.05 seconds Step 110600/150000, Loss: 4.1434040546417235, Test Loss: 4.182300448417664, LR: 5e-06, Elapsed Time: 2529.31 seconds Step 110700/150000, Loss: 4.147428193092346, Test Loss: 4.182361483573914, LR: 5e-06, Elapsed Time: 2531.58 seconds Step 110800/150000, Loss: 4.157831814289093, Test Loss: 4.182270884513855, LR: 5e-06, Elapsed Time: 2533.83 seconds Step 110900/150000, Loss: 4.157587933540344, Test Loss: 4.182239592075348, LR: 5e-06, Elapsed Time: 2536.08 seconds Step 111000/150000, Loss: 4.148485136032105, Test Loss: 4.182348310947418, LR: 5e-06, Elapsed Time: 2538.34 seconds Step 111100/150000, Loss: 4.157586669921875, Test Loss: 4.182294547557831, LR: 5e-06, Elapsed Time: 2540.59 seconds Step 111200/150000, Loss: 4.15070599079132, Test Loss: 4.182343780994415, LR: 5e-06, Elapsed Time: 2542.85 seconds Step 111300/150000, Loss: 4.1476256465911865, Test Loss: 4.1823365688323975, LR: 5e-06, Elapsed Time: 2545.10 seconds Step 111400/150000, Loss: 4.1499711656570435, Test Loss: 4.182194888591766, LR: 5e-06, Elapsed Time: 2547.37 seconds Step 111500/150000, Loss: 4.156613259315491, Test Loss: 4.182305991649628, LR: 5e-06, Elapsed Time: 2549.62 seconds Step 111600/150000, Loss: 4.1429481291770935, Test Loss: 4.182067334651947, LR: 5e-06, Elapsed Time: 2551.87 seconds Step 111700/150000, Loss: 4.155889172554016, Test Loss: 4.1821237206459045, LR: 5e-06, Elapsed Time: 2554.13 seconds Step 111800/150000, Loss: 4.157183351516724, Test Loss: 4.18218606710434, LR: 5e-06, Elapsed Time: 2556.39 seconds Step 111900/150000, Loss: 4.146600050926208, Test Loss: 4.182138741016388, LR: 5e-06, Elapsed Time: 2558.65 seconds Step 112000/150000, Loss: 4.146382441520691, Test Loss: 4.1821916699409485, LR: 5e-06, Elapsed Time: 2560.91 seconds Step 112100/150000, Loss: 4.144274435043335, Test Loss: 4.182193160057068, LR: 5e-06, Elapsed Time: 2563.15 seconds Step 112200/150000, Loss: 4.143755121231079, Test Loss: 4.182167291641235, LR: 5e-06, Elapsed Time: 2565.41 seconds Step 112300/150000, Loss: 4.14727258682251, Test Loss: 4.18217271566391, LR: 5e-06, Elapsed Time: 2567.66 seconds Step 112400/150000, Loss: 4.13538088798523, Test Loss: 4.182157278060913, LR: 5e-06, Elapsed Time: 2569.92 seconds Step 112500/150000, Loss: 4.147740640640259, Test Loss: 4.182156264781952, LR: 5e-06, Elapsed Time: 2572.17 seconds Step 112600/150000, Loss: 4.15642023563385, Test Loss: 4.182025969028473, LR: 5e-06, Elapsed Time: 2574.42 seconds Step 112700/150000, Loss: 4.150073022842407, Test Loss: 4.18211567401886, LR: 5e-06, Elapsed Time: 2576.69 seconds Step 112800/150000, Loss: 4.15554172039032, Test Loss: 4.182019531726837, LR: 5e-06, Elapsed Time: 2578.95 seconds Step 112900/150000, Loss: 4.165143713951111, Test Loss: 4.1819270849227905, LR: 5e-06, Elapsed Time: 2581.20 seconds Step 113000/150000, Loss: 4.157542719841003, Test Loss: 4.1819576025009155, LR: 5e-06, Elapsed Time: 2583.47 seconds Step 113100/150000, Loss: 4.15804057598114, Test Loss: 4.181941449642181, LR: 5e-06, Elapsed Time: 2585.73 seconds Step 113200/150000, Loss: 4.157160432338714, Test Loss: 4.181927680969238, LR: 5e-06, Elapsed Time: 2587.98 seconds Step 113300/150000, Loss: 4.149738156795502, Test Loss: 4.182100892066956, LR: 5e-06, Elapsed Time: 2590.23 seconds Step 113400/150000, Loss: 4.165736474990845, Test Loss: 4.1819047927856445, LR: 5e-06, Elapsed Time: 2592.48 seconds Step 113500/150000, Loss: 4.151355280876159, Test Loss: 4.182053804397583, LR: 5e-06, Elapsed Time: 2594.74 seconds Step 113600/150000, Loss: 4.148731408119201, Test Loss: 4.182029902935028, LR: 5e-06, Elapsed Time: 2596.99 seconds Step 113700/150000, Loss: 4.157613034248352, Test Loss: 4.181879878044128, LR: 5e-06, Elapsed Time: 2599.25 seconds Step 113800/150000, Loss: 4.148982124328613, Test Loss: 4.1819621324539185, LR: 5e-06, Elapsed Time: 2601.51 seconds Step 113900/150000, Loss: 4.151141030788422, Test Loss: 4.1820067167282104, LR: 5e-06, Elapsed Time: 2603.76 seconds Step 114000/150000, Loss: 4.161500372886658, Test Loss: 4.182067692279816, LR: 5e-06, Elapsed Time: 2606.02 seconds Step 114100/150000, Loss: 4.160679459571838, Test Loss: 4.182023227214813, LR: 5e-06, Elapsed Time: 2608.27 seconds Step 114200/150000, Loss: 4.156543412208557, Test Loss: 4.182026207447052, LR: 5e-06, Elapsed Time: 2610.53 seconds Step 114300/150000, Loss: 4.159319648742676, Test Loss: 4.181870400905609, LR: 5e-06, Elapsed Time: 2612.78 seconds Step 114400/150000, Loss: 4.144652523994446, Test Loss: 4.181992948055267, LR: 5e-06, Elapsed Time: 2615.03 seconds Step 114500/150000, Loss: 4.14619888305664, Test Loss: 4.18203204870224, LR: 5e-06, Elapsed Time: 2617.28 seconds Step 114600/150000, Loss: 4.155129632949829, Test Loss: 4.18196713924408, LR: 5e-06, Elapsed Time: 2619.52 seconds Step 114700/150000, Loss: 4.148270916938782, Test Loss: 4.182024896144867, LR: 5e-06, Elapsed Time: 2621.77 seconds Step 114800/150000, Loss: 4.153865852355957, Test Loss: 4.181897938251495, LR: 5e-06, Elapsed Time: 2624.01 seconds Step 114900/150000, Loss: 4.147097158432007, Test Loss: 4.181892454624176, LR: 5e-06, Elapsed Time: 2626.26 seconds Step 115000/150000, Loss: 4.158718729019165, Test Loss: 4.181888163089752, LR: 5e-06, Elapsed Time: 2628.51 seconds Step 115100/150000, Loss: 4.149253008365631, Test Loss: 4.181963562965393, LR: 5e-06, Elapsed Time: 2630.77 seconds Step 115200/150000, Loss: 4.134530744552612, Test Loss: 4.181991100311279, LR: 5e-06, Elapsed Time: 2633.02 seconds Step 115300/150000, Loss: 4.15236962556839, Test Loss: 4.182008862495422, LR: 5e-06, Elapsed Time: 2635.27 seconds Step 115400/150000, Loss: 4.166981887817383, Test Loss: 4.182031035423279, LR: 5e-06, Elapsed Time: 2637.52 seconds Step 115500/150000, Loss: 4.145422413349151, Test Loss: 4.182002425193787, LR: 5e-06, Elapsed Time: 2639.78 seconds Step 115600/150000, Loss: 4.15366828918457, Test Loss: 4.181926906108856, LR: 5e-06, Elapsed Time: 2642.03 seconds Step 115700/150000, Loss: 4.14850688457489, Test Loss: 4.18221390247345, LR: 5e-06, Elapsed Time: 2644.28 seconds Step 115800/150000, Loss: 4.1477928280830385, Test Loss: 4.18206250667572, LR: 5e-06, Elapsed Time: 2646.54 seconds Step 115900/150000, Loss: 4.153482587337494, Test Loss: 4.182090878486633, LR: 5e-06, Elapsed Time: 2648.80 seconds Step 116000/150000, Loss: 4.161479394435883, Test Loss: 4.18221253156662, LR: 5e-06, Elapsed Time: 2651.06 seconds Step 116100/150000, Loss: 4.1488297033309935, Test Loss: 4.182152330875397, LR: 5e-06, Elapsed Time: 2653.32 seconds Step 116200/150000, Loss: 4.1394637894630435, Test Loss: 4.182214915752411, LR: 5e-06, Elapsed Time: 2655.59 seconds Step 116300/150000, Loss: 4.14739560842514, Test Loss: 4.1819727420806885, LR: 5e-06, Elapsed Time: 2657.85 seconds Step 116400/150000, Loss: 4.149780783653259, Test Loss: 4.182024717330933, LR: 5e-06, Elapsed Time: 2660.11 seconds Step 116500/150000, Loss: 4.140627384185791, Test Loss: 4.181994557380676, LR: 5e-06, Elapsed Time: 2662.37 seconds Step 116600/150000, Loss: 4.151362566947937, Test Loss: 4.18192458152771, LR: 5e-06, Elapsed Time: 2664.62 seconds Step 116700/150000, Loss: 4.1385877990722655, Test Loss: 4.181865811347961, LR: 5e-06, Elapsed Time: 2666.87 seconds Step 116800/150000, Loss: 4.158994064331055, Test Loss: 4.1818459033966064, LR: 5e-06, Elapsed Time: 2669.12 seconds Step 116900/150000, Loss: 4.1364145398139955, Test Loss: 4.181944489479065, LR: 5e-06, Elapsed Time: 2671.38 seconds Step 117000/150000, Loss: 4.143009347915649, Test Loss: 4.18189001083374, LR: 5e-06, Elapsed Time: 2673.63 seconds Step 117100/150000, Loss: 4.152676014900208, Test Loss: 4.181855499744415, LR: 5e-06, Elapsed Time: 2675.89 seconds Step 117200/150000, Loss: 4.132827467918396, Test Loss: 4.181959807872772, LR: 5e-06, Elapsed Time: 2678.15 seconds Step 117300/150000, Loss: 4.150890145301819, Test Loss: 4.181816339492798, LR: 5e-06, Elapsed Time: 2680.40 seconds Step 117400/150000, Loss: 4.134679813385009, Test Loss: 4.181865572929382, LR: 5e-06, Elapsed Time: 2682.66 seconds Step 117500/150000, Loss: 4.147399582862854, Test Loss: 4.181855618953705, LR: 5e-06, Elapsed Time: 2684.92 seconds Step 117600/150000, Loss: 4.145980162620544, Test Loss: 4.181827127933502, LR: 5e-06, Elapsed Time: 2687.20 seconds Step 117700/150000, Loss: 4.14761791229248, Test Loss: 4.181865692138672, LR: 5e-06, Elapsed Time: 2689.46 seconds Step 117800/150000, Loss: 4.142864291667938, Test Loss: 4.181860327720642, LR: 5e-06, Elapsed Time: 2691.72 seconds Step 117900/150000, Loss: 4.146791317462921, Test Loss: 4.181806921958923, LR: 5e-06, Elapsed Time: 2693.97 seconds Step 118000/150000, Loss: 4.140257749557495, Test Loss: 4.18182235956192, LR: 5e-06, Elapsed Time: 2696.23 seconds Step 118100/150000, Loss: 4.13670841217041, Test Loss: 4.1817373633384705, LR: 5e-06, Elapsed Time: 2698.48 seconds Step 118200/150000, Loss: 4.140655727386474, Test Loss: 4.1819493770599365, LR: 5e-06, Elapsed Time: 2700.73 seconds Step 118300/150000, Loss: 4.144402906894684, Test Loss: 4.18203604221344, LR: 5e-06, Elapsed Time: 2702.98 seconds Step 118400/150000, Loss: 4.144405262470245, Test Loss: 4.181841671466827, LR: 5e-06, Elapsed Time: 2705.24 seconds Step 118500/150000, Loss: 4.140990843772888, Test Loss: 4.1818817257881165, LR: 5e-06, Elapsed Time: 2707.50 seconds Step 118600/150000, Loss: 4.148238885402679, Test Loss: 4.181893825531006, LR: 5e-06, Elapsed Time: 2709.76 seconds Step 118700/150000, Loss: 4.144192721843719, Test Loss: 4.181848168373108, LR: 5e-06, Elapsed Time: 2712.01 seconds Step 118800/150000, Loss: 4.127857682704925, Test Loss: 4.182120144367218, LR: 5e-06, Elapsed Time: 2714.26 seconds Step 118900/150000, Loss: 4.132006888389587, Test Loss: 4.181890547275543, LR: 5e-06, Elapsed Time: 2716.52 seconds Step 119000/150000, Loss: 4.140169262886047, Test Loss: 4.181946516036987, LR: 5e-06, Elapsed Time: 2718.77 seconds Step 119100/150000, Loss: 4.140655879974365, Test Loss: 4.18196314573288, LR: 5e-06, Elapsed Time: 2721.02 seconds Step 119200/150000, Loss: 4.1339039659500125, Test Loss: 4.181931018829346, LR: 5e-06, Elapsed Time: 2723.28 seconds Step 119300/150000, Loss: 4.13203373670578, Test Loss: 4.181909620761871, LR: 5e-06, Elapsed Time: 2725.54 seconds Step 119400/150000, Loss: 4.132776193618774, Test Loss: 4.181890428066254, LR: 5e-06, Elapsed Time: 2727.79 seconds Step 119500/150000, Loss: 4.14517247915268, Test Loss: 4.181988775730133, LR: 5e-06, Elapsed Time: 2730.04 seconds Step 119600/150000, Loss: 4.140990190505981, Test Loss: 4.181983828544617, LR: 5e-06, Elapsed Time: 2732.30 seconds Step 119700/150000, Loss: 4.135424752235412, Test Loss: 4.18206650018692, LR: 5e-06, Elapsed Time: 2734.56 seconds Step 119800/150000, Loss: 4.145800905227661, Test Loss: 4.182049870491028, LR: 5e-06, Elapsed Time: 2736.82 seconds Step 119900/150000, Loss: 4.138657898902893, Test Loss: 4.1817967891693115, LR: 5e-06, Elapsed Time: 2739.07 seconds Step 120000/150000, Loss: 4.137166090011597, Test Loss: 4.181829333305359, LR: 5e-06, Elapsed Time: 2741.33 seconds Step 120100/150000, Loss: 4.1335843205451965, Test Loss: 4.182016491889954, LR: 5e-06, Elapsed Time: 2743.58 seconds Step 120200/150000, Loss: 4.140254406929016, Test Loss: 4.1819387674331665, LR: 5e-06, Elapsed Time: 2745.84 seconds Step 120300/150000, Loss: 4.14901364326477, Test Loss: 4.181840121746063, LR: 5e-06, Elapsed Time: 2748.09 seconds Step 120400/150000, Loss: 4.133540589809417, Test Loss: 4.181945025920868, LR: 5e-06, Elapsed Time: 2750.35 seconds Step 120500/150000, Loss: 4.131813671588898, Test Loss: 4.181842803955078, LR: 5e-06, Elapsed Time: 2752.61 seconds Step 120600/150000, Loss: 4.123791677951813, Test Loss: 4.181964874267578, LR: 5e-06, Elapsed Time: 2754.86 seconds Step 120700/150000, Loss: 4.135897347927093, Test Loss: 4.1819958090782166, LR: 5e-06, Elapsed Time: 2757.12 seconds Step 120800/150000, Loss: 4.1303275847435, Test Loss: 4.18209844827652, LR: 5e-06, Elapsed Time: 2759.38 seconds Step 120900/150000, Loss: 4.135337829589844, Test Loss: 4.182056546211243, LR: 5e-06, Elapsed Time: 2761.65 seconds Step 121000/150000, Loss: 4.132717368602752, Test Loss: 4.182052135467529, LR: 5e-06, Elapsed Time: 2763.90 seconds Step 121100/150000, Loss: 4.134334852695465, Test Loss: 4.181936264038086, LR: 5e-06, Elapsed Time: 2766.16 seconds Step 121200/150000, Loss: 4.123981580734253, Test Loss: 4.182041883468628, LR: 5e-06, Elapsed Time: 2768.41 seconds Step 121300/150000, Loss: 4.127435364723206, Test Loss: 4.1821372509002686, LR: 5e-06, Elapsed Time: 2770.67 seconds Step 121400/150000, Loss: 4.159071564674377, Test Loss: 4.181911468505859, LR: 5e-06, Elapsed Time: 2772.93 seconds Step 121500/150000, Loss: 4.1649861335754395, Test Loss: 4.181731402873993, LR: 5e-06, Elapsed Time: 2775.18 seconds Step 121600/150000, Loss: 4.158884854316711, Test Loss: 4.181727230548859, LR: 5e-06, Elapsed Time: 2777.44 seconds Step 121700/150000, Loss: 4.1710513401031495, Test Loss: 4.181761384010315, LR: 5e-06, Elapsed Time: 2779.70 seconds Step 121800/150000, Loss: 4.153055021762848, Test Loss: 4.1818326115608215, LR: 5e-06, Elapsed Time: 2781.96 seconds Step 121900/150000, Loss: 4.159375276565552, Test Loss: 4.181648552417755, LR: 5e-06, Elapsed Time: 2784.22 seconds Step 122000/150000, Loss: 4.163316497802734, Test Loss: 4.18173211812973, LR: 5e-06, Elapsed Time: 2786.47 seconds Step 122100/150000, Loss: 4.153690824508667, Test Loss: 4.181877255439758, LR: 5e-06, Elapsed Time: 2788.72 seconds Step 122200/150000, Loss: 4.157295455932617, Test Loss: 4.181889057159424, LR: 5e-06, Elapsed Time: 2790.98 seconds Step 122300/150000, Loss: 4.159229340553284, Test Loss: 4.181705415248871, LR: 5e-06, Elapsed Time: 2793.23 seconds Step 122400/150000, Loss: 4.156752336025238, Test Loss: 4.1817368268966675, LR: 5e-06, Elapsed Time: 2795.48 seconds Step 122500/150000, Loss: 4.147787728309631, Test Loss: 4.181783080101013, LR: 5e-06, Elapsed Time: 2797.74 seconds Step 122600/150000, Loss: 4.16032422542572, Test Loss: 4.181823253631592, LR: 5e-06, Elapsed Time: 2799.99 seconds Step 122700/150000, Loss: 4.165674891471863, Test Loss: 4.181883990764618, LR: 5e-06, Elapsed Time: 2802.25 seconds Step 122800/150000, Loss: 4.153863697052002, Test Loss: 4.18187552690506, LR: 5e-06, Elapsed Time: 2804.50 seconds Step 122900/150000, Loss: 4.150179584026336, Test Loss: 4.181854486465454, LR: 5e-06, Elapsed Time: 2806.76 seconds Step 123000/150000, Loss: 4.148250744342804, Test Loss: 4.181765377521515, LR: 5e-06, Elapsed Time: 2809.02 seconds Step 123100/150000, Loss: 4.157749433517456, Test Loss: 4.181732773780823, LR: 5e-06, Elapsed Time: 2811.29 seconds Step 123200/150000, Loss: 4.1519507837295535, Test Loss: 4.18162339925766, LR: 5e-06, Elapsed Time: 2813.55 seconds Step 123300/150000, Loss: 4.156062030792237, Test Loss: 4.181559681892395, LR: 5e-06, Elapsed Time: 2815.81 seconds Step 123400/150000, Loss: 4.161894068717957, Test Loss: 4.181565284729004, LR: 5e-06, Elapsed Time: 2818.07 seconds Step 123500/150000, Loss: 4.1506724405288695, Test Loss: 4.181475520133972, LR: 5e-06, Elapsed Time: 2820.34 seconds Step 123600/150000, Loss: 4.152840025424958, Test Loss: 4.181672990322113, LR: 5e-06, Elapsed Time: 2822.59 seconds Step 123700/150000, Loss: 4.156029741764069, Test Loss: 4.181765556335449, LR: 5e-06, Elapsed Time: 2824.85 seconds Step 123800/150000, Loss: 4.1599409294128415, Test Loss: 4.181640803813934, LR: 5e-06, Elapsed Time: 2827.12 seconds Step 123900/150000, Loss: 4.146530921459198, Test Loss: 4.181649565696716, LR: 5e-06, Elapsed Time: 2829.37 seconds Step 124000/150000, Loss: 4.152094464302063, Test Loss: 4.181772589683533, LR: 5e-06, Elapsed Time: 2831.64 seconds Step 124100/150000, Loss: 4.142446751594544, Test Loss: 4.181711852550507, LR: 5e-06, Elapsed Time: 2833.89 seconds Step 124200/150000, Loss: 4.148900952339172, Test Loss: 4.181776404380798, LR: 5e-06, Elapsed Time: 2836.14 seconds Step 124300/150000, Loss: 4.156900997161865, Test Loss: 4.1817467212677, LR: 5e-06, Elapsed Time: 2838.40 seconds Step 124400/150000, Loss: 4.148980157375336, Test Loss: 4.181779265403748, LR: 5e-06, Elapsed Time: 2840.66 seconds Step 124500/150000, Loss: 4.150394163131714, Test Loss: 4.181774258613586, LR: 5e-06, Elapsed Time: 2842.92 seconds Step 124600/150000, Loss: 4.154425678253173, Test Loss: 4.181757986545563, LR: 5e-06, Elapsed Time: 2845.17 seconds Step 124700/150000, Loss: 4.139901387691498, Test Loss: 4.181874454021454, LR: 5e-06, Elapsed Time: 2847.44 seconds Step 124800/150000, Loss: 4.15435087442398, Test Loss: 4.181739687919617, LR: 5e-06, Elapsed Time: 2849.68 seconds Step 124900/150000, Loss: 4.156291320323944, Test Loss: 4.1817198395729065, LR: 5e-06, Elapsed Time: 2851.94 seconds Step 125000/150000, Loss: 4.1428249549865725, Test Loss: 4.181617915630341, LR: 5e-06, Elapsed Time: 2854.19 seconds Step 125100/150000, Loss: 4.161484522819519, Test Loss: 4.181633830070496, LR: 5e-06, Elapsed Time: 2856.44 seconds Step 125200/150000, Loss: 4.150545539855957, Test Loss: 4.1815988421440125, LR: 5e-06, Elapsed Time: 2858.69 seconds Step 125300/150000, Loss: 4.138963198661804, Test Loss: 4.181654214859009, LR: 5e-06, Elapsed Time: 2860.95 seconds Step 125400/150000, Loss: 4.14000364780426, Test Loss: 4.181680500507355, LR: 5e-06, Elapsed Time: 2863.20 seconds Step 125500/150000, Loss: 4.157096199989319, Test Loss: 4.181566953659058, LR: 5e-06, Elapsed Time: 2865.45 seconds Step 125600/150000, Loss: 4.146410193443298, Test Loss: 4.181634366512299, LR: 5e-06, Elapsed Time: 2867.71 seconds Step 125700/150000, Loss: 4.152270107269287, Test Loss: 4.181476593017578, LR: 5e-06, Elapsed Time: 2869.96 seconds Step 125800/150000, Loss: 4.151828274726868, Test Loss: 4.181437730789185, LR: 5e-06, Elapsed Time: 2872.22 seconds Step 125900/150000, Loss: 4.14664783000946, Test Loss: 4.181487798690796, LR: 5e-06, Elapsed Time: 2874.47 seconds Step 126000/150000, Loss: 4.155303215980529, Test Loss: 4.181423366069794, LR: 5e-06, Elapsed Time: 2876.73 seconds Step 126100/150000, Loss: 4.149375920295715, Test Loss: 4.1813507080078125, LR: 5e-06, Elapsed Time: 2878.98 seconds Step 126200/150000, Loss: 4.157417650222778, Test Loss: 4.181538641452789, LR: 5e-06, Elapsed Time: 2881.24 seconds Step 126300/150000, Loss: 4.14254980802536, Test Loss: 4.181646227836609, LR: 5e-06, Elapsed Time: 2883.50 seconds Step 126400/150000, Loss: 4.140755379199982, Test Loss: 4.181546211242676, LR: 5e-06, Elapsed Time: 2885.75 seconds Step 126500/150000, Loss: 4.1530619668960576, Test Loss: 4.1815454959869385, LR: 5e-06, Elapsed Time: 2888.01 seconds Step 126600/150000, Loss: 4.145195422172546, Test Loss: 4.181601047515869, LR: 5e-06, Elapsed Time: 2890.27 seconds Step 126700/150000, Loss: 4.158602879047394, Test Loss: 4.181508779525757, LR: 5e-06, Elapsed Time: 2892.53 seconds Step 126800/150000, Loss: 4.145369207859039, Test Loss: 4.181495845317841, LR: 5e-06, Elapsed Time: 2894.79 seconds Step 126900/150000, Loss: 4.131147680282592, Test Loss: 4.181473135948181, LR: 5e-06, Elapsed Time: 2897.05 seconds Step 127000/150000, Loss: 4.136784770488739, Test Loss: 4.181524276733398, LR: 5e-06, Elapsed Time: 2899.31 seconds Step 127100/150000, Loss: 4.148173480033875, Test Loss: 4.181537389755249, LR: 5e-06, Elapsed Time: 2901.57 seconds Step 127200/150000, Loss: 4.154543423652649, Test Loss: 4.181467592716217, LR: 5e-06, Elapsed Time: 2903.83 seconds Step 127300/150000, Loss: 4.141052598953247, Test Loss: 4.181428015232086, LR: 5e-06, Elapsed Time: 2906.09 seconds Step 127400/150000, Loss: 4.150248193740845, Test Loss: 4.181328594684601, LR: 5e-06, Elapsed Time: 2908.34 seconds Step 127500/150000, Loss: 4.143479623794556, Test Loss: 4.181455135345459, LR: 5e-06, Elapsed Time: 2910.60 seconds Step 127600/150000, Loss: 4.142916204929352, Test Loss: 4.1814024448394775, LR: 5e-06, Elapsed Time: 2912.85 seconds Step 127700/150000, Loss: 4.135766248703003, Test Loss: 4.1814568638801575, LR: 5e-06, Elapsed Time: 2915.11 seconds Step 127800/150000, Loss: 4.142787203788758, Test Loss: 4.1815338134765625, LR: 5e-06, Elapsed Time: 2917.37 seconds Step 127900/150000, Loss: 4.153836424350739, Test Loss: 4.181381344795227, LR: 5e-06, Elapsed Time: 2919.62 seconds Step 128000/150000, Loss: 4.150507664680481, Test Loss: 4.181241989135742, LR: 5e-06, Elapsed Time: 2921.88 seconds Step 128100/150000, Loss: 4.145661752223969, Test Loss: 4.1812968254089355, LR: 5e-06, Elapsed Time: 2924.14 seconds Step 128200/150000, Loss: 4.1499343729019165, Test Loss: 4.181242823600769, LR: 5e-06, Elapsed Time: 2926.40 seconds Step 128300/150000, Loss: 4.14426750421524, Test Loss: 4.1813793778419495, LR: 5e-06, Elapsed Time: 2928.67 seconds Step 128400/150000, Loss: 4.1366752576828, Test Loss: 4.181293487548828, LR: 5e-06, Elapsed Time: 2930.92 seconds Step 128500/150000, Loss: 4.140944304466248, Test Loss: 4.181335091590881, LR: 5e-06, Elapsed Time: 2933.19 seconds Step 128600/150000, Loss: 4.137900934219361, Test Loss: 4.181239545345306, LR: 5e-06, Elapsed Time: 2935.44 seconds Step 128700/150000, Loss: 4.138459091186523, Test Loss: 4.181259989738464, LR: 5e-06, Elapsed Time: 2937.70 seconds Step 128800/150000, Loss: 4.142805697917939, Test Loss: 4.18126517534256, LR: 5e-06, Elapsed Time: 2939.96 seconds Step 128900/150000, Loss: 4.141567413806915, Test Loss: 4.18113374710083, LR: 5e-06, Elapsed Time: 2942.22 seconds Step 129000/150000, Loss: 4.136330661773681, Test Loss: 4.1813578605651855, LR: 5e-06, Elapsed Time: 2944.47 seconds Step 129100/150000, Loss: 4.138098182678223, Test Loss: 4.18156772851944, LR: 5e-06, Elapsed Time: 2946.72 seconds Step 129200/150000, Loss: 4.153723649978637, Test Loss: 4.1814223527908325, LR: 5e-06, Elapsed Time: 2948.98 seconds Step 129300/150000, Loss: 4.161256446838379, Test Loss: 4.1814181208610535, LR: 5e-06, Elapsed Time: 2951.25 seconds Step 129400/150000, Loss: 4.165924334526062, Test Loss: 4.181452631950378, LR: 5e-06, Elapsed Time: 2953.51 seconds Step 129500/150000, Loss: 4.150084707736969, Test Loss: 4.181361019611359, LR: 5e-06, Elapsed Time: 2955.77 seconds Step 129600/150000, Loss: 4.150240681171417, Test Loss: 4.181359350681305, LR: 5e-06, Elapsed Time: 2958.03 seconds Step 129700/150000, Loss: 4.157313137054444, Test Loss: 4.181387841701508, LR: 5e-06, Elapsed Time: 2960.29 seconds Step 129800/150000, Loss: 4.1555680084228515, Test Loss: 4.1813108921051025, LR: 5e-06, Elapsed Time: 2962.55 seconds Step 129900/150000, Loss: 4.159912166595459, Test Loss: 4.181372821331024, LR: 5e-06, Elapsed Time: 2964.81 seconds Step 130000/150000, Loss: 4.151067097187042, Test Loss: 4.181275010108948, LR: 5e-06, Elapsed Time: 2967.07 seconds Step 130100/150000, Loss: 4.160516471862793, Test Loss: 4.181325018405914, LR: 5e-06, Elapsed Time: 2969.33 seconds Step 130200/150000, Loss: 4.14968774318695, Test Loss: 4.1813576221466064, LR: 5e-06, Elapsed Time: 2971.58 seconds Step 130300/150000, Loss: 4.150251178741455, Test Loss: 4.181163787841797, LR: 5e-06, Elapsed Time: 2973.85 seconds Step 130400/150000, Loss: 4.1699363899230955, Test Loss: 4.181101500988007, LR: 5e-06, Elapsed Time: 2976.11 seconds Step 130500/150000, Loss: 4.157500033378601, Test Loss: 4.18120151758194, LR: 5e-06, Elapsed Time: 2978.36 seconds Step 130600/150000, Loss: 4.166793093681336, Test Loss: 4.181151330471039, LR: 5e-06, Elapsed Time: 2980.62 seconds Step 130700/150000, Loss: 4.16063283443451, Test Loss: 4.181126058101654, LR: 5e-06, Elapsed Time: 2982.87 seconds Step 130800/150000, Loss: 4.161303510665894, Test Loss: 4.181246042251587, LR: 5e-06, Elapsed Time: 2985.13 seconds Step 130900/150000, Loss: 4.164380402565002, Test Loss: 4.181098163127899, LR: 5e-06, Elapsed Time: 2987.38 seconds Step 131000/150000, Loss: 4.151716828346252, Test Loss: 4.1813271045684814, LR: 5e-06, Elapsed Time: 2989.63 seconds Step 131100/150000, Loss: 4.159277296066284, Test Loss: 4.181358873844147, LR: 5e-06, Elapsed Time: 2991.89 seconds Step 131200/150000, Loss: 4.1548303031921385, Test Loss: 4.181319355964661, LR: 5e-06, Elapsed Time: 2994.15 seconds Step 131300/150000, Loss: 4.154317810535431, Test Loss: 4.181254863739014, LR: 5e-06, Elapsed Time: 2996.40 seconds Step 131400/150000, Loss: 4.152444107532501, Test Loss: 4.1810285449028015, LR: 5e-06, Elapsed Time: 2998.65 seconds Step 131500/150000, Loss: 4.154593987464905, Test Loss: 4.180910587310791, LR: 5e-06, Elapsed Time: 3000.91 seconds Step 131600/150000, Loss: 4.150109543800354, Test Loss: 4.180918455123901, LR: 5e-06, Elapsed Time: 3003.17 seconds Step 131700/150000, Loss: 4.156760025024414, Test Loss: 4.180985331535339, LR: 5e-06, Elapsed Time: 3005.42 seconds Step 131800/150000, Loss: 4.153700911998749, Test Loss: 4.180987775325775, LR: 5e-06, Elapsed Time: 3007.69 seconds Step 131900/150000, Loss: 4.155892918109894, Test Loss: 4.180764675140381, LR: 5e-06, Elapsed Time: 3009.95 seconds Step 132000/150000, Loss: 4.156841316223145, Test Loss: 4.181000053882599, LR: 5e-06, Elapsed Time: 3012.20 seconds Step 132100/150000, Loss: 4.1526202297210695, Test Loss: 4.181062877178192, LR: 5e-06, Elapsed Time: 3014.46 seconds Step 132200/150000, Loss: 4.1479080867767335, Test Loss: 4.1810256242752075, LR: 5e-06, Elapsed Time: 3016.72 seconds Step 132300/150000, Loss: 4.132697849273682, Test Loss: 4.181207120418549, LR: 5e-06, Elapsed Time: 3018.97 seconds Step 132400/150000, Loss: 4.152479288578033, Test Loss: 4.181059300899506, LR: 5e-06, Elapsed Time: 3021.23 seconds Step 132500/150000, Loss: 4.149134273529053, Test Loss: 4.1810537576675415, LR: 5e-06, Elapsed Time: 3023.48 seconds Step 132600/150000, Loss: 4.142528357505799, Test Loss: 4.181151211261749, LR: 5e-06, Elapsed Time: 3025.73 seconds Step 132700/150000, Loss: 4.1589872980117795, Test Loss: 4.180977702140808, LR: 5e-06, Elapsed Time: 3027.99 seconds Step 132800/150000, Loss: 4.149524872303009, Test Loss: 4.1809868812561035, LR: 5e-06, Elapsed Time: 3030.24 seconds Step 132900/150000, Loss: 4.154537162780762, Test Loss: 4.18122124671936, LR: 5e-06, Elapsed Time: 3032.50 seconds Step 133000/150000, Loss: 4.158372926712036, Test Loss: 4.181107938289642, LR: 5e-06, Elapsed Time: 3034.76 seconds Step 133100/150000, Loss: 4.145253925323487, Test Loss: 4.1810309290885925, LR: 5e-06, Elapsed Time: 3037.02 seconds Step 133200/150000, Loss: 4.1399498462677, Test Loss: 4.181110739707947, LR: 5e-06, Elapsed Time: 3039.27 seconds Step 133300/150000, Loss: 4.16040673494339, Test Loss: 4.180877983570099, LR: 5e-06, Elapsed Time: 3041.53 seconds Step 133400/150000, Loss: 4.14818953037262, Test Loss: 4.1810103058815, LR: 5e-06, Elapsed Time: 3043.78 seconds Step 133500/150000, Loss: 4.146742291450501, Test Loss: 4.180799186229706, LR: 5e-06, Elapsed Time: 3046.03 seconds Step 133600/150000, Loss: 4.155557007789612, Test Loss: 4.180810868740082, LR: 5e-06, Elapsed Time: 3048.29 seconds Step 133700/150000, Loss: 4.150594158172607, Test Loss: 4.180923640727997, LR: 5e-06, Elapsed Time: 3050.54 seconds Step 133800/150000, Loss: 4.145835716724395, Test Loss: 4.1808470487594604, LR: 5e-06, Elapsed Time: 3052.80 seconds Step 133900/150000, Loss: 4.144240927696228, Test Loss: 4.180925488471985, LR: 5e-06, Elapsed Time: 3055.05 seconds Step 134000/150000, Loss: 4.143832082748413, Test Loss: 4.18086576461792, LR: 5e-06, Elapsed Time: 3057.31 seconds Step 134100/150000, Loss: 4.143670716285706, Test Loss: 4.180878460407257, LR: 5e-06, Elapsed Time: 3059.56 seconds Step 134200/150000, Loss: 4.140794360637665, Test Loss: 4.180964112281799, LR: 5e-06, Elapsed Time: 3061.81 seconds Step 134300/150000, Loss: 4.1404292154312134, Test Loss: 4.180882215499878, LR: 5e-06, Elapsed Time: 3064.06 seconds Step 134400/150000, Loss: 4.148724758625031, Test Loss: 4.180740237236023, LR: 5e-06, Elapsed Time: 3066.32 seconds Step 134500/150000, Loss: 4.1556789541244505, Test Loss: 4.18079686164856, LR: 5e-06, Elapsed Time: 3068.56 seconds Step 134600/150000, Loss: 4.149005475044251, Test Loss: 4.180815100669861, LR: 5e-06, Elapsed Time: 3070.82 seconds Step 134700/150000, Loss: 4.161994409561157, Test Loss: 4.180668950080872, LR: 5e-06, Elapsed Time: 3073.07 seconds Step 134800/150000, Loss: 4.1566725969314575, Test Loss: 4.18065619468689, LR: 5e-06, Elapsed Time: 3075.33 seconds Step 134900/150000, Loss: 4.152757797241211, Test Loss: 4.1806018352508545, LR: 5e-06, Elapsed Time: 3077.59 seconds Step 135000/150000, Loss: 4.155890913009643, Test Loss: 4.1806640625, LR: 5e-06, Elapsed Time: 3079.84 seconds Step 135100/150000, Loss: 4.1620371055603025, Test Loss: 4.180613458156586, LR: 5e-06, Elapsed Time: 3082.10 seconds Step 135200/150000, Loss: 4.150667989253998, Test Loss: 4.180708050727844, LR: 5e-06, Elapsed Time: 3084.35 seconds Step 135300/150000, Loss: 4.158370022773743, Test Loss: 4.180633962154388, LR: 5e-06, Elapsed Time: 3086.61 seconds Step 135400/150000, Loss: 4.148505845069885, Test Loss: 4.1807741522789, LR: 5e-06, Elapsed Time: 3088.86 seconds Step 135500/150000, Loss: 4.148550260066986, Test Loss: 4.180688500404358, LR: 5e-06, Elapsed Time: 3091.11 seconds Step 135600/150000, Loss: 4.1600560235977175, Test Loss: 4.1806416511535645, LR: 5e-06, Elapsed Time: 3093.37 seconds Step 135700/150000, Loss: 4.142225279808044, Test Loss: 4.180648863315582, LR: 5e-06, Elapsed Time: 3095.62 seconds Step 135800/150000, Loss: 4.157450475692749, Test Loss: 4.180757164955139, LR: 5e-06, Elapsed Time: 3097.88 seconds Step 135900/150000, Loss: 4.15175096988678, Test Loss: 4.180720567703247, LR: 5e-06, Elapsed Time: 3100.14 seconds Step 136000/150000, Loss: 4.155349378585815, Test Loss: 4.180803835391998, LR: 5e-06, Elapsed Time: 3102.40 seconds Step 136100/150000, Loss: 4.155440516471863, Test Loss: 4.1807039976119995, LR: 5e-06, Elapsed Time: 3104.65 seconds Step 136200/150000, Loss: 4.157934260368347, Test Loss: 4.18072122335434, LR: 5e-06, Elapsed Time: 3106.91 seconds Step 136300/150000, Loss: 4.139308223724365, Test Loss: 4.180696070194244, LR: 5e-06, Elapsed Time: 3109.16 seconds Step 136400/150000, Loss: 4.149413270950317, Test Loss: 4.18073034286499, LR: 5e-06, Elapsed Time: 3111.42 seconds Step 136500/150000, Loss: 4.149836130142212, Test Loss: 4.180678188800812, LR: 5e-06, Elapsed Time: 3113.68 seconds Step 136600/150000, Loss: 4.147969222068786, Test Loss: 4.180661857128143, LR: 5e-06, Elapsed Time: 3115.94 seconds Step 136700/150000, Loss: 4.153535614013672, Test Loss: 4.180656909942627, LR: 5e-06, Elapsed Time: 3118.19 seconds Step 136800/150000, Loss: 4.149936199188232, Test Loss: 4.180674254894257, LR: 5e-06, Elapsed Time: 3120.45 seconds Step 136900/150000, Loss: 4.145710363388061, Test Loss: 4.180688798427582, LR: 5e-06, Elapsed Time: 3122.70 seconds Step 137000/150000, Loss: 4.1526238322258, Test Loss: 4.1806952357292175, LR: 5e-06, Elapsed Time: 3124.96 seconds Step 137100/150000, Loss: 4.140164530277252, Test Loss: 4.180743753910065, LR: 5e-06, Elapsed Time: 3127.22 seconds Step 137200/150000, Loss: 4.154197213649749, Test Loss: 4.18074768781662, LR: 5e-06, Elapsed Time: 3129.47 seconds Step 137300/150000, Loss: 4.1588286519050595, Test Loss: 4.1807573437690735, LR: 5e-06, Elapsed Time: 3131.73 seconds Step 137400/150000, Loss: 4.146291317939759, Test Loss: 4.180715978145599, LR: 5e-06, Elapsed Time: 3133.98 seconds Step 137500/150000, Loss: 4.142855167388916, Test Loss: 4.180676221847534, LR: 5e-06, Elapsed Time: 3136.23 seconds Step 137600/150000, Loss: 4.151567249298096, Test Loss: 4.180994093418121, LR: 5e-06, Elapsed Time: 3138.49 seconds Step 137700/150000, Loss: 4.148460640907287, Test Loss: 4.180766224861145, LR: 5e-06, Elapsed Time: 3140.74 seconds Step 137800/150000, Loss: 4.1576125431060795, Test Loss: 4.180857837200165, LR: 5e-06, Elapsed Time: 3143.01 seconds Step 137900/150000, Loss: 4.148126668930054, Test Loss: 4.180827796459198, LR: 5e-06, Elapsed Time: 3145.26 seconds Step 138000/150000, Loss: 4.144606010913849, Test Loss: 4.180875599384308, LR: 5e-06, Elapsed Time: 3147.51 seconds Step 138100/150000, Loss: 4.142617239952087, Test Loss: 4.180846452713013, LR: 5e-06, Elapsed Time: 3149.77 seconds Step 138200/150000, Loss: 4.145609557628632, Test Loss: 4.180811822414398, LR: 5e-06, Elapsed Time: 3152.03 seconds Step 138300/150000, Loss: 4.145351595878601, Test Loss: 4.180744528770447, LR: 5e-06, Elapsed Time: 3154.28 seconds Step 138400/150000, Loss: 4.143880844116211, Test Loss: 4.1807814836502075, LR: 5e-06, Elapsed Time: 3156.54 seconds Step 138500/150000, Loss: 4.14545716047287, Test Loss: 4.180710971355438, LR: 5e-06, Elapsed Time: 3158.79 seconds Step 138600/150000, Loss: 4.135269021987915, Test Loss: 4.180614590644836, LR: 5e-06, Elapsed Time: 3161.05 seconds Step 138700/150000, Loss: 4.156927795410156, Test Loss: 4.180579125881195, LR: 5e-06, Elapsed Time: 3163.30 seconds Step 138800/150000, Loss: 4.134905803203583, Test Loss: 4.180674970149994, LR: 5e-06, Elapsed Time: 3165.55 seconds Step 138900/150000, Loss: 4.145070972442627, Test Loss: 4.180669188499451, LR: 5e-06, Elapsed Time: 3167.81 seconds Step 139000/150000, Loss: 4.148415899276733, Test Loss: 4.18072384595871, LR: 5e-06, Elapsed Time: 3170.07 seconds Step 139100/150000, Loss: 4.135412149429321, Test Loss: 4.180743217468262, LR: 5e-06, Elapsed Time: 3172.33 seconds Step 139200/150000, Loss: 4.141506609916687, Test Loss: 4.180526077747345, LR: 5e-06, Elapsed Time: 3174.58 seconds Step 139300/150000, Loss: 4.136495673656464, Test Loss: 4.180550515651703, LR: 5e-06, Elapsed Time: 3176.84 seconds Step 139400/150000, Loss: 4.146461436748504, Test Loss: 4.180645108222961, LR: 5e-06, Elapsed Time: 3179.10 seconds Step 139500/150000, Loss: 4.140581934452057, Test Loss: 4.180588781833649, LR: 5e-06, Elapsed Time: 3181.36 seconds Step 139600/150000, Loss: 4.151443638801575, Test Loss: 4.1805760860443115, LR: 5e-06, Elapsed Time: 3183.61 seconds Step 139700/150000, Loss: 4.136056385040283, Test Loss: 4.180652916431427, LR: 5e-06, Elapsed Time: 3185.87 seconds Step 139800/150000, Loss: 4.146726565361023, Test Loss: 4.180597305297852, LR: 5e-06, Elapsed Time: 3188.13 seconds Step 139900/150000, Loss: 4.139076569080353, Test Loss: 4.180607259273529, LR: 5e-06, Elapsed Time: 3190.38 seconds Step 140000/150000, Loss: 4.134179356098175, Test Loss: 4.180575907230377, LR: 5e-06, Elapsed Time: 3192.63 seconds Step 140100/150000, Loss: 4.138518481254578, Test Loss: 4.180704176425934, LR: 5e-06, Elapsed Time: 3194.88 seconds Step 140200/150000, Loss: 4.145107164382934, Test Loss: 4.1807790994644165, LR: 5e-06, Elapsed Time: 3197.14 seconds Step 140300/150000, Loss: 4.1344160962104795, Test Loss: 4.180650591850281, LR: 5e-06, Elapsed Time: 3199.40 seconds Step 140400/150000, Loss: 4.144437236785889, Test Loss: 4.180719792842865, LR: 5e-06, Elapsed Time: 3201.65 seconds Step 140500/150000, Loss: 4.147898693084716, Test Loss: 4.180627763271332, LR: 5e-06, Elapsed Time: 3203.90 seconds Step 140600/150000, Loss: 4.136467070579529, Test Loss: 4.180700957775116, LR: 5e-06, Elapsed Time: 3206.16 seconds Step 140700/150000, Loss: 4.123574466705322, Test Loss: 4.180872619152069, LR: 5e-06, Elapsed Time: 3208.41 seconds Step 140800/150000, Loss: 4.1416349649429325, Test Loss: 4.180676102638245, LR: 5e-06, Elapsed Time: 3210.66 seconds Step 140900/150000, Loss: 4.1361446905136106, Test Loss: 4.18076229095459, LR: 5e-06, Elapsed Time: 3212.92 seconds Step 141000/150000, Loss: 4.140163555145263, Test Loss: 4.180735766887665, LR: 5e-06, Elapsed Time: 3215.17 seconds Step 141100/150000, Loss: 4.130803995132446, Test Loss: 4.180750012397766, LR: 5e-06, Elapsed Time: 3217.43 seconds Step 141200/150000, Loss: 4.129034531116486, Test Loss: 4.180616140365601, LR: 5e-06, Elapsed Time: 3219.68 seconds Step 141300/150000, Loss: 4.135971484184265, Test Loss: 4.180709600448608, LR: 5e-06, Elapsed Time: 3221.94 seconds Step 141400/150000, Loss: 4.138126845359802, Test Loss: 4.180687785148621, LR: 5e-06, Elapsed Time: 3224.20 seconds Step 141500/150000, Loss: 4.147747678756714, Test Loss: 4.1807098388671875, LR: 5e-06, Elapsed Time: 3226.45 seconds Step 141600/150000, Loss: 4.130188779830933, Test Loss: 4.180795609951019, LR: 5e-06, Elapsed Time: 3228.71 seconds Step 141700/150000, Loss: 4.1367925262451175, Test Loss: 4.180750787258148, LR: 5e-06, Elapsed Time: 3230.96 seconds Step 141800/150000, Loss: 4.142098579406738, Test Loss: 4.180681765079498, LR: 5e-06, Elapsed Time: 3233.22 seconds Step 141900/150000, Loss: 4.136783003807068, Test Loss: 4.180688142776489, LR: 5e-06, Elapsed Time: 3235.47 seconds Step 142000/150000, Loss: 4.134648907184601, Test Loss: 4.180769979953766, LR: 5e-06, Elapsed Time: 3237.72 seconds Step 142100/150000, Loss: 4.136258013248444, Test Loss: 4.180747926235199, LR: 5e-06, Elapsed Time: 3239.97 seconds Step 142200/150000, Loss: 4.148324015140534, Test Loss: 4.180603086948395, LR: 5e-06, Elapsed Time: 3242.22 seconds Step 142300/150000, Loss: 4.126793384552002, Test Loss: 4.180701911449432, LR: 5e-06, Elapsed Time: 3244.46 seconds Step 142400/150000, Loss: 4.130730347633362, Test Loss: 4.180586993694305, LR: 5e-06, Elapsed Time: 3246.71 seconds Step 142500/150000, Loss: 4.1303426551818845, Test Loss: 4.180789530277252, LR: 5e-06, Elapsed Time: 3248.97 seconds Step 142600/150000, Loss: 4.128604137897492, Test Loss: 4.180763602256775, LR: 5e-06, Elapsed Time: 3251.23 seconds Step 142700/150000, Loss: 4.130739464759826, Test Loss: 4.180921673774719, LR: 5e-06, Elapsed Time: 3253.48 seconds Step 142800/150000, Loss: 4.134466123580933, Test Loss: 4.180934131145477, LR: 5e-06, Elapsed Time: 3255.72 seconds Step 142900/150000, Loss: 4.136259725093842, Test Loss: 4.180843830108643, LR: 5e-06, Elapsed Time: 3257.98 seconds Step 143000/150000, Loss: 4.1253086376190184, Test Loss: 4.180842876434326, LR: 5e-06, Elapsed Time: 3260.23 seconds Step 143100/150000, Loss: 4.125257115364075, Test Loss: 4.180930733680725, LR: 5e-06, Elapsed Time: 3262.49 seconds Step 143200/150000, Loss: 4.130850279331208, Test Loss: 4.180885493755341, LR: 5e-06, Elapsed Time: 3264.74 seconds Step 143300/150000, Loss: 4.162068078517914, Test Loss: 4.1806520819664, LR: 5e-06, Elapsed Time: 3266.99 seconds Step 143400/150000, Loss: 4.165802745819092, Test Loss: 4.180557191371918, LR: 5e-06, Elapsed Time: 3269.25 seconds Step 143500/150000, Loss: 4.161580121517181, Test Loss: 4.180514812469482, LR: 5e-06, Elapsed Time: 3271.51 seconds Step 143600/150000, Loss: 4.161072676181793, Test Loss: 4.180498957633972, LR: 5e-06, Elapsed Time: 3273.76 seconds Step 143700/150000, Loss: 4.148600263595581, Test Loss: 4.180596709251404, LR: 5e-06, Elapsed Time: 3276.01 seconds Step 143800/150000, Loss: 4.159546422958374, Test Loss: 4.180392622947693, LR: 5e-06, Elapsed Time: 3278.26 seconds Step 143900/150000, Loss: 4.161140251159668, Test Loss: 4.180577218532562, LR: 5e-06, Elapsed Time: 3280.51 seconds Step 144000/150000, Loss: 4.1581151485443115, Test Loss: 4.180643200874329, LR: 5e-06, Elapsed Time: 3282.77 seconds Step 144100/150000, Loss: 4.153283135890961, Test Loss: 4.180673182010651, LR: 5e-06, Elapsed Time: 3285.03 seconds Step 144200/150000, Loss: 4.159139046669006, Test Loss: 4.180613815784454, LR: 5e-06, Elapsed Time: 3287.28 seconds Step 144300/150000, Loss: 4.149131350517273, Test Loss: 4.1805219650268555, LR: 5e-06, Elapsed Time: 3289.53 seconds Step 144400/150000, Loss: 4.149352078437805, Test Loss: 4.180687367916107, LR: 5e-06, Elapsed Time: 3291.79 seconds Step 144500/150000, Loss: 4.156682364940643, Test Loss: 4.180707037448883, LR: 5e-06, Elapsed Time: 3294.04 seconds Step 144600/150000, Loss: 4.166525187492371, Test Loss: 4.180635154247284, LR: 5e-06, Elapsed Time: 3296.30 seconds Step 144700/150000, Loss: 4.154554615020752, Test Loss: 4.180738031864166, LR: 5e-06, Elapsed Time: 3298.56 seconds Step 144800/150000, Loss: 4.14352961063385, Test Loss: 4.180616021156311, LR: 5e-06, Elapsed Time: 3300.82 seconds Step 144900/150000, Loss: 4.151920824050904, Test Loss: 4.180610001087189, LR: 5e-06, Elapsed Time: 3303.08 seconds Step 145000/150000, Loss: 4.148179974555969, Test Loss: 4.180620789527893, LR: 5e-06, Elapsed Time: 3305.33 seconds Step 145100/150000, Loss: 4.155771980285644, Test Loss: 4.18039208650589, LR: 5e-06, Elapsed Time: 3307.58 seconds Step 145200/150000, Loss: 4.154883942604065, Test Loss: 4.180370271205902, LR: 5e-06, Elapsed Time: 3309.83 seconds Step 145300/150000, Loss: 4.154076066017151, Test Loss: 4.1803149580955505, LR: 5e-06, Elapsed Time: 3312.08 seconds Step 145400/150000, Loss: 4.1558993887901305, Test Loss: 4.180402517318726, LR: 5e-06, Elapsed Time: 3314.33 seconds Step 145500/150000, Loss: 4.1507432389259336, Test Loss: 4.180459439754486, LR: 5e-06, Elapsed Time: 3316.59 seconds Step 145600/150000, Loss: 4.14922342300415, Test Loss: 4.180556654930115, LR: 5e-06, Elapsed Time: 3318.84 seconds Step 145700/150000, Loss: 4.164098992347717, Test Loss: 4.180440068244934, LR: 5e-06, Elapsed Time: 3321.10 seconds Step 145800/150000, Loss: 4.145811583995819, Test Loss: 4.18051677942276, LR: 5e-06, Elapsed Time: 3323.36 seconds Step 145900/150000, Loss: 4.1386399078369145, Test Loss: 4.180556654930115, LR: 5e-06, Elapsed Time: 3325.61 seconds Step 146000/150000, Loss: 4.148672757148742, Test Loss: 4.180585861206055, LR: 5e-06, Elapsed Time: 3327.87 seconds Step 146100/150000, Loss: 4.151382846832275, Test Loss: 4.180614411830902, LR: 5e-06, Elapsed Time: 3330.12 seconds Step 146200/150000, Loss: 4.1468266272544865, Test Loss: 4.180597722530365, LR: 5e-06, Elapsed Time: 3332.38 seconds Step 146300/150000, Loss: 4.153343448638916, Test Loss: 4.180604457855225, LR: 5e-06, Elapsed Time: 3334.63 seconds Step 146400/150000, Loss: 4.1464552354812625, Test Loss: 4.180599570274353, LR: 5e-06, Elapsed Time: 3336.88 seconds Step 146500/150000, Loss: 4.146523184776306, Test Loss: 4.1806159019470215, LR: 5e-06, Elapsed Time: 3339.13 seconds Step 146600/150000, Loss: 4.140141241550445, Test Loss: 4.180782318115234, LR: 5e-06, Elapsed Time: 3341.38 seconds Step 146700/150000, Loss: 4.163375818729401, Test Loss: 4.18051290512085, LR: 5e-06, Elapsed Time: 3343.63 seconds Step 146800/150000, Loss: 4.148511486053467, Test Loss: 4.180515646934509, LR: 5e-06, Elapsed Time: 3345.89 seconds Step 146900/150000, Loss: 4.147431492805481, Test Loss: 4.180420696735382, LR: 5e-06, Elapsed Time: 3348.14 seconds Step 147000/150000, Loss: 4.15636759519577, Test Loss: 4.180519163608551, LR: 5e-06, Elapsed Time: 3350.40 seconds Step 147100/150000, Loss: 4.148008179664612, Test Loss: 4.180407285690308, LR: 5e-06, Elapsed Time: 3352.65 seconds Step 147200/150000, Loss: 4.134121441841126, Test Loss: 4.180474638938904, LR: 5e-06, Elapsed Time: 3354.91 seconds Step 147300/150000, Loss: 4.145106959342956, Test Loss: 4.180422842502594, LR: 5e-06, Elapsed Time: 3357.16 seconds Step 147400/150000, Loss: 4.1536610579490665, Test Loss: 4.180434823036194, LR: 5e-06, Elapsed Time: 3359.41 seconds Step 147500/150000, Loss: 4.141767630577087, Test Loss: 4.180455923080444, LR: 5e-06, Elapsed Time: 3361.67 seconds Step 147600/150000, Loss: 4.157983636856079, Test Loss: 4.180297672748566, LR: 5e-06, Elapsed Time: 3363.92 seconds Step 147700/150000, Loss: 4.14756781578064, Test Loss: 4.180255055427551, LR: 5e-06, Elapsed Time: 3366.18 seconds Step 147800/150000, Loss: 4.143155150413513, Test Loss: 4.180320560932159, LR: 5e-06, Elapsed Time: 3368.43 seconds Step 147900/150000, Loss: 4.153578977584839, Test Loss: 4.180245220661163, LR: 5e-06, Elapsed Time: 3370.69 seconds Step 148000/150000, Loss: 4.155347971916199, Test Loss: 4.180192351341248, LR: 5e-06, Elapsed Time: 3372.94 seconds Step 148100/150000, Loss: 4.150769820213318, Test Loss: 4.180369675159454, LR: 5e-06, Elapsed Time: 3375.18 seconds Step 148200/150000, Loss: 4.141762051582337, Test Loss: 4.1805577874183655, LR: 5e-06, Elapsed Time: 3377.44 seconds Step 148300/150000, Loss: 4.139307117462158, Test Loss: 4.180364668369293, LR: 5e-06, Elapsed Time: 3379.69 seconds Step 148400/150000, Loss: 4.149774701595306, Test Loss: 4.180434763431549, LR: 5e-06, Elapsed Time: 3381.95 seconds Step 148500/150000, Loss: 4.14942238330841, Test Loss: 4.1803507804870605, LR: 5e-06, Elapsed Time: 3384.19 seconds Step 148600/150000, Loss: 4.160615088939667, Test Loss: 4.180263042449951, LR: 5e-06, Elapsed Time: 3386.44 seconds Step 148700/150000, Loss: 4.13036456823349, Test Loss: 4.180291593074799, LR: 5e-06, Elapsed Time: 3388.70 seconds Step 148800/150000, Loss: 4.13461437702179, Test Loss: 4.1803136467933655, LR: 5e-06, Elapsed Time: 3390.96 seconds Step 148900/150000, Loss: 4.143134479522705, Test Loss: 4.180439054965973, LR: 5e-06, Elapsed Time: 3393.21 seconds Step 149000/150000, Loss: 4.143254282474518, Test Loss: 4.180271029472351, LR: 5e-06, Elapsed Time: 3395.46 seconds Step 149100/150000, Loss: 4.1517451691627505, Test Loss: 4.180273711681366, LR: 5e-06, Elapsed Time: 3397.71 seconds Step 149200/150000, Loss: 4.146310234069825, Test Loss: 4.180260002613068, LR: 5e-06, Elapsed Time: 3399.96 seconds Step 149300/150000, Loss: 4.1473937559127805, Test Loss: 4.180250108242035, LR: 5e-06, Elapsed Time: 3402.22 seconds Step 149400/150000, Loss: 4.139902973175049, Test Loss: 4.180222570896149, LR: 5e-06, Elapsed Time: 3404.47 seconds Step 149500/150000, Loss: 4.14453691482544, Test Loss: 4.180257201194763, LR: 5e-06, Elapsed Time: 3406.73 seconds Step 149600/150000, Loss: 4.13150342464447, Test Loss: 4.1803149580955505, LR: 5e-06, Elapsed Time: 3408.98 seconds Step 149700/150000, Loss: 4.142852101325989, Test Loss: 4.180355608463287, LR: 5e-06, Elapsed Time: 3411.24 seconds Step 149800/150000, Loss: 4.155404534339905, Test Loss: 4.180193781852722, LR: 5e-06, Elapsed Time: 3413.50 seconds Step 149900/150000, Loss: 4.14969812631607, Test Loss: 4.180100202560425, LR: 5e-06, Elapsed Time: 3415.75 seconds Step 150000/150000, Loss: 4.143090476989746, Test Loss: 4.1801371574401855, LR: 5e-06, Elapsed Time: 3418.00 seconds Saving model checkpoint at step 150000
if use_existing_model:
print("Existing model used, no loss curves shown.")
plt.imshow(plt.imread("./loss_curve.png"))
else:
plt.figure(figsize=(10, 6))
plt.plot(losses, label="Train Loss", color='blue')
plt.plot(test_losses, label="Test Loss", color='red')
plt.xlabel('Checkpoint')
plt.ylabel('Loss')
plt.title('Training and Test Loss Over Time')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()
if not use_existing_model:
torch.save(model, f"./pretrain_final.pth")Now that we have pretrained the model, we can perform some inference examples to see what types of outputs we get from the model. We can see that the model is able to output legible english, and most of the words make sense, however, its size limits make it not quite as robust as larger models. It is still good enough to see the "sparks" of understanding language.
In this dataset, we trained on news articles so I've started the sentences with phrases that could potentially be found in the news. If you rerun the cell below this you will see that you get different outputs every time. This is due to the randomness of the next token selection step.
def inference(prompt,torch_model, max_new_tokens):
torch_model.eval()
with torch.no_grad():
tokens = hf_tokenizer.encode(prompt)
for _ in range(max_new_tokens):
num_tokens = len(tokens)
tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens)
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = torch_model(tokens_padded)
probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1)
predicted_token = torch.multinomial(probabilities, 1).item()
tokens.append(predicted_token)
return hf_tokenizer.decode(tokens)print("Predicted:", inference("The president signed a bill to pass", model, max_new_tokens=20))
print("Predicted:", inference("There was a large division in", model, max_new_tokens=20))
print("Predicted:", inference("Reports are showing that", model, max_new_tokens=20))Predicted: The president signed a bill to pass legislation that would allow for tax breaks if enacted into the law' law as it does not allow the Predicted: There was a large division in his office, probably for up to 40,000 years ago and has been hospitalized with more than 30 Predicted: Reports are showing that cherry-slipped bazers had effectively volunteered. ‘I think we object about the fact
To make the model more useable, we can take the pretrained model, and then go through a process called supervised fine tuning. This process involves having high quality supervised text datasets to get the model to respond how we want.
We can use the Fact Q&A dataset from huggingface for this. This dataset consists of question - answer examples that are short, which is good for our use case since we have a small context window of 128 tokens.
Supervised fine tuning is where we can introduce "tags" and other types of text tokens that can help the model understand different roles in the text. For our dataset, we will have a "question" tag and an "answer" tag. We will add all of these when we create our dataset, and also during inference when a user submits a query. We also add eos tokens to end/pad the examples that do not take up the full context window.
After fine tuning on this dataset, ideally we will have a LLM that you can ask a question and get an answer.
# Load dataset in streaming mode
sft_ds = load_dataset("rubenroy/GammaCorpus-Fact-QA-450k", split="train", streaming=True)
def check_sft_dataset_exists():
try:
# Attempt to load the dataset with reuse_cache_if_exists mode
load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")
return True
except FileNotFoundError:
return False
if not check_sft_dataset_exists():
print("Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.")
def tokenize_and_chunk(dataset, tokenizer, chunk_size=512, rows=1000):
"""
Tokenizes and chunks the dataset into fixed-length 512-token segments.
The 'target' sequence is shifted left by 1 token.
Stops after generating `train_rows + test_rows` tokenized chunks.
"""
row_count = 0
for example in dataset:
question_plus_answer = "<Question>" + example["question"] + "</Question>" + "<Answer>" + example["answer"] + "</Answer>"
input_tokens = tokenizer(question_plus_answer, truncation=False, padding=False)['input_ids']
if row_count >= rows:
return
if len(input_tokens) >= chunk_size:
continue
else:
input_tokens = input_tokens +[tokenizer.eos_token_id] * (chunk_size - len(input_tokens))
target_tokens = input_tokens[1:] + [tokenizer.eos_token_id] # Shifted by 1 token
yield {
"input": input_tokens,
"target": target_tokens
}
row_count += 1
# Set the max number of rows for training and testing
TRAIN_ROWS = 440000 # Adjust as needed
TEST_ROWS = 500 # Adjust as needed
CHUNK_SIZE = 128
# Convert generator to a Hugging Face Dataset
tokenized_sft_dataset = Dataset.from_generator(lambda: tokenize_and_chunk(sft_ds, hf_tokenizer,chunk_size=CHUNK_SIZE, rows=TRAIN_ROWS + TEST_ROWS))
# Split the dataset into `train` and `test`
sft_dataset_splits = tokenized_sft_dataset.train_test_split(train_size=TRAIN_ROWS, test_size=TEST_ROWS, seed=42)
# Save to disk
sft_dataset_splits["train"].to_parquet("fact_qa_train.parquet")
sft_dataset_splits["test"].to_parquet("fact_qa_test.parquet")
print(f"✅ Saved {TRAIN_ROWS} train rows and {TEST_ROWS} test rows for supervised fine tuning.")
else:
print("SFT Tokenized dataset already exists locally.")README.md: 0.00B [00:00, ?B/s]
Tokenized supervised fine tuning dataset does not exist locally... Generating and saving to disk.
Generating train split: 0 examples [00:00, ? examples/s]
Creating parquet from Arrow format: 0%| | 0/10 [00:00<?, ?ba/s]
Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]
✅ Saved 440000 train rows and 500 test rows for supervised fine tuning.
A very similar training loop can be used for supervised fine tuning.
# Example config:
batch_size = 64
sequence_len = 128
num_steps = 50000
accumulation_steps = 100
# Reload the train and test datasets
train_ds = load_dataset("parquet", data_files="fact_qa_train.parquet", split="train")
test_ds = load_dataset("parquet", data_files="fact_qa_test.parquet", split="train")
# Convert dataset to PyTorch format
train_ds.set_format("torch", columns=["input", "target"])
test_ds.set_format("torch", columns=["input", "target"])
# Create DataLoaders for training and testing
train_dataloader = cycle(DataLoader(train_ds, batch_size=batch_size, shuffle=False))
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
use_existing_model = os.path.exists("./sft_final.pth")
# Check if pre-trained model exists
if use_existing_model:
model = torch.load("./sft_final.pth", weights_only=False)
print("Loaded fine tuned model from ./sft_final.pth, skipping training loop.")
else:
# For SFT we start with the pretrained model
model = torch.load("./pretrain_final.pth", weights_only=False)
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
# Scheduler with dynamic step size
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min',factor=0.2, patience=10, min_lr=5e-6, threshold=1e-4)
# Training loop
losses = []
test_losses = []
accumulator = 0
accumulator_loss = 0
for i in range(num_steps):
model.train()
example = next(train_dataloader)
train_input = example["input"].to(device)
train_target = example["target"].to(device)
logits = model(train_input)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), train_target.view(-1))
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights
optimizer.step()
optimizer.zero_grad()
accumulator += 1
accumulator_loss += loss.item()
if accumulator >= accumulation_steps:
losses.append(accumulator_loss / accumulation_steps)
accumulator = 0
accumulator_loss = 0
model.eval()
test_loss = 0
test_accumulator = 0
with torch.no_grad():
for test_example in test_dataloader:
test_input = test_example["input"].to(device)
test_target = test_example["target"].to(device)
test_logits = model(test_input)
test_loss += F.cross_entropy(test_logits.view(-1, test_logits.size(-1)), test_target.view(-1)).item()
test_accumulator += 1
test_losses.append(test_loss / test_accumulator)
print(f"Step {i+1}/{num_steps}, Loss: {losses[-1]}, Test Loss: {test_losses[-1]}")
test_dataloader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
scheduler.step(test_losses[-1])
if i+1 % 50000 == 0:
torch.save(model.state_dict(), f"./sft_model_checkpoint_{i}.pt")
Step 100/50000, Loss: 1.7894065862894057, Test Loss: 0.6899487301707268 Step 200/50000, Loss: 0.670128984451294, Test Loss: 0.6649056524038315 Step 300/50000, Loss: 0.6577387464046478, Test Loss: 0.655867725610733 Step 400/50000, Loss: 0.6484179145097733, Test Loss: 0.6457256749272346 Step 500/50000, Loss: 0.6337686544656753, Test Loss: 0.6401033475995064 Step 600/50000, Loss: 0.6327831310033798, Test Loss: 0.6335209831595421 Step 700/50000, Loss: 0.6319840627908707, Test Loss: 0.6287727132439613 Step 800/50000, Loss: 0.629341772198677, Test Loss: 0.6234856620430946 Step 900/50000, Loss: 0.618602660894394, Test Loss: 0.6195669993758202 Step 1000/50000, Loss: 0.6160968893766403, Test Loss: 0.6167683377861977 Step 1100/50000, Loss: 0.6157135808467865, Test Loss: 0.6147342845797539 Step 1200/50000, Loss: 0.6101788258552552, Test Loss: 0.612005814909935 Step 1300/50000, Loss: 0.6089721620082855, Test Loss: 0.6109145656228065 Step 1400/50000, Loss: 0.6084770160913467, Test Loss: 0.607772946357727 Step 1500/50000, Loss: 0.6058419317007064, Test Loss: 0.6048060208559036 Step 1600/50000, Loss: 0.5959779250621796, Test Loss: 0.6020506843924522 Step 1700/50000, Loss: 0.6007200157642365, Test Loss: 0.6022788658738136 Step 1800/50000, Loss: 0.5974611109495163, Test Loss: 0.6012213677167892 Step 1900/50000, Loss: 0.5952364253997803, Test Loss: 0.5988002642989159 Step 2000/50000, Loss: 0.593831394314766, Test Loss: 0.5949253216385841 Step 2100/50000, Loss: 0.5941272938251495, Test Loss: 0.5931232944130898 Step 2200/50000, Loss: 0.591301035284996, Test Loss: 0.591670885682106 Step 2300/50000, Loss: 0.5909951883554458, Test Loss: 0.5886573493480682 Step 2400/50000, Loss: 0.5865880775451661, Test Loss: 0.5890826284885406 Step 2500/50000, Loss: 0.5821933543682098, Test Loss: 0.5872417092323303 Step 2600/50000, Loss: 0.5839130717515946, Test Loss: 0.5845912173390388 Step 2700/50000, Loss: 0.5817171996831894, Test Loss: 0.5846689864993095 Step 2800/50000, Loss: 0.5854867315292358, Test Loss: 0.584762692451477 Step 2900/50000, Loss: 0.5824162870645523, Test Loss: 0.5827516466379166 Step 3000/50000, Loss: 0.5816735351085662, Test Loss: 0.5794886350631714 Step 3100/50000, Loss: 0.5774682822823525, Test Loss: 0.580800287425518 Step 3200/50000, Loss: 0.5784483242034912, Test Loss: 0.5820211842656136 Step 3300/50000, Loss: 0.5794898337125778, Test Loss: 0.5806005746126175 Step 3400/50000, Loss: 0.5787150442600251, Test Loss: 0.5781652107834816 Step 3500/50000, Loss: 0.5767926639318466, Test Loss: 0.5785172060132027 Step 3600/50000, Loss: 0.5774986898899078, Test Loss: 0.5784606635570526 Step 3700/50000, Loss: 0.5698395538330078, Test Loss: 0.5762661471962929 Step 3800/50000, Loss: 0.5723054140806199, Test Loss: 0.5732337087392807 Step 3900/50000, Loss: 0.5704968816041946, Test Loss: 0.573347195982933 Step 4000/50000, Loss: 0.5742578566074371, Test Loss: 0.5710273906588554 Step 4100/50000, Loss: 0.5689345055818558, Test Loss: 0.5699852406978607 Step 4200/50000, Loss: 0.5685835617780686, Test Loss: 0.5711279734969139 Step 4300/50000, Loss: 0.5643544238805771, Test Loss: 0.5684011429548264 Step 4400/50000, Loss: 0.5655131494998932, Test Loss: 0.5685702115297318 Step 4500/50000, Loss: 0.5658017551898956, Test Loss: 0.5672628208994865 Step 4600/50000, Loss: 0.5628653198480607, Test Loss: 0.5666623264551163 Step 4700/50000, Loss: 0.5685244762897491, Test Loss: 0.5647096410393715 Step 4800/50000, Loss: 0.5635990315675735, Test Loss: 0.5669257789850235 Step 4900/50000, Loss: 0.5676164382696152, Test Loss: 0.5649148002266884 Step 5000/50000, Loss: 0.5633213102817536, Test Loss: 0.5636829733848572 Step 5100/50000, Loss: 0.5604616793990135, Test Loss: 0.5637211948633194 Step 5200/50000, Loss: 0.5641232460737229, Test Loss: 0.5615629330277443 Step 5300/50000, Loss: 0.5570856201648712, Test Loss: 0.5636357069015503 Step 5400/50000, Loss: 0.5624988043308258, Test Loss: 0.5596609488129616 Step 5500/50000, Loss: 0.5551352027058601, Test Loss: 0.5593733713030815 Step 5600/50000, Loss: 0.5606354176998138, Test Loss: 0.5590610057115555 Step 5700/50000, Loss: 0.5523125407099724, Test Loss: 0.5590964779257774 Step 5800/50000, Loss: 0.557363595366478, Test Loss: 0.5583673641085625 Step 5900/50000, Loss: 0.5576430636644364, Test Loss: 0.5588566064834595 Step 6000/50000, Loss: 0.5533191740512848, Test Loss: 0.5587872043251991 Step 6100/50000, Loss: 0.5526760491728783, Test Loss: 0.5563570559024811 Step 6200/50000, Loss: 0.5569000604748726, Test Loss: 0.5565835759043694 Step 6300/50000, Loss: 0.5561074531078338, Test Loss: 0.553500734269619 Step 6400/50000, Loss: 0.5481549808382988, Test Loss: 0.5524075031280518 Step 6500/50000, Loss: 0.5519813224673271, Test Loss: 0.5524388253688812 Step 6600/50000, Loss: 0.5500669786334038, Test Loss: 0.5516507029533386 Step 6700/50000, Loss: 0.5529505521059036, Test Loss: 0.5526987984776497 Step 6800/50000, Loss: 0.5495224264264107, Test Loss: 0.5512131825089455 Step 6900/50000, Loss: 0.546148627102375, Test Loss: 0.5518527403473854 Step 7000/50000, Loss: 0.5403213331103325, Test Loss: 0.5508043244481087 Step 7100/50000, Loss: 0.5375803881883621, Test Loss: 0.5507242307066917 Step 7200/50000, Loss: 0.5373097231984139, Test Loss: 0.5512633696198463 Step 7300/50000, Loss: 0.5378788873553276, Test Loss: 0.5511386767029762 Step 7400/50000, Loss: 0.5346931657195091, Test Loss: 0.5513263121247292 Step 7500/50000, Loss: 0.536243808567524, Test Loss: 0.5502893775701523 Step 7600/50000, Loss: 0.5417343014478684, Test Loss: 0.5490069314837456 Step 7700/50000, Loss: 0.5377320700883865, Test Loss: 0.5479213520884514 Step 7800/50000, Loss: 0.5326224866509438, Test Loss: 0.5472695827484131 Step 7900/50000, Loss: 0.5382641625404357, Test Loss: 0.54770527780056 Step 8000/50000, Loss: 0.5327157139778137, Test Loss: 0.5483704879879951 Step 8100/50000, Loss: 0.53377620190382, Test Loss: 0.5477786287665367 Step 8200/50000, Loss: 0.5340671905875206, Test Loss: 0.5471154898405075 Step 8300/50000, Loss: 0.5355316486954689, Test Loss: 0.5471793934702873 Step 8400/50000, Loss: 0.5301006320118904, Test Loss: 0.5461340025067329 Step 8500/50000, Loss: 0.5247205343842506, Test Loss: 0.5459344834089279 Step 8600/50000, Loss: 0.5329261514544487, Test Loss: 0.5460227504372597 Step 8700/50000, Loss: 0.531003688275814, Test Loss: 0.5452506318688393 Step 8800/50000, Loss: 0.5302336066961288, Test Loss: 0.5453066304326057 Step 8900/50000, Loss: 0.5285433578491211, Test Loss: 0.544139675796032 Step 9000/50000, Loss: 0.5326492401957512, Test Loss: 0.5432911440730095 Step 9100/50000, Loss: 0.5281846457719803, Test Loss: 0.5422489270567894 Step 9200/50000, Loss: 0.5273299798369407, Test Loss: 0.5413840934634209 Step 9300/50000, Loss: 0.5258684808015823, Test Loss: 0.5431383922696114 Step 9400/50000, Loss: 0.5253734466433525, Test Loss: 0.5410608500242233 Step 9500/50000, Loss: 0.5250219839811325, Test Loss: 0.5397867262363434 Step 9600/50000, Loss: 0.5252009320259095, Test Loss: 0.5413575246930122 Step 9700/50000, Loss: 0.5261444211006164, Test Loss: 0.5417118743062019 Step 9800/50000, Loss: 0.5283768391609192, Test Loss: 0.5390029847621918 Step 9900/50000, Loss: 0.5264323997497559, Test Loss: 0.5391514673829079 Step 10000/50000, Loss: 0.5207903230190277, Test Loss: 0.539966531097889 Step 10100/50000, Loss: 0.5296834912896157, Test Loss: 0.539964348077774 Step 10200/50000, Loss: 0.5260419407486916, Test Loss: 0.5388037040829659 Step 10300/50000, Loss: 0.5255704522132874, Test Loss: 0.5408822074532509 Step 10400/50000, Loss: 0.5260245826840401, Test Loss: 0.5401119887828827 Step 10500/50000, Loss: 0.5253755643963813, Test Loss: 0.5392967835068703 Step 10600/50000, Loss: 0.5215079271793366, Test Loss: 0.5370707809925079 Step 10700/50000, Loss: 0.5210157689452172, Test Loss: 0.5376391485333443 Step 10800/50000, Loss: 0.5194737631082534, Test Loss: 0.5357342511415482 Step 10900/50000, Loss: 0.526770293712616, Test Loss: 0.5352799519896507 Step 11000/50000, Loss: 0.5204985249042511, Test Loss: 0.5359285399317741 Step 11100/50000, Loss: 0.5214085018634796, Test Loss: 0.5379253625869751 Step 11200/50000, Loss: 0.5164320465922355, Test Loss: 0.5349118411540985 Step 11300/50000, Loss: 0.519132778942585, Test Loss: 0.5369766876101494 Step 11400/50000, Loss: 0.518926048874855, Test Loss: 0.536350816488266 Step 11500/50000, Loss: 0.5197267335653305, Test Loss: 0.5340079292654991 Step 11600/50000, Loss: 0.5219586622714997, Test Loss: 0.5338058620691299 Step 11700/50000, Loss: 0.520624511539936, Test Loss: 0.5356609150767326 Step 11800/50000, Loss: 0.5220979690551758, Test Loss: 0.5347660779953003 Step 11900/50000, Loss: 0.5167272743582726, Test Loss: 0.5345161184668541 Step 12000/50000, Loss: 0.5206729030609131, Test Loss: 0.5343651175498962 Step 12100/50000, Loss: 0.5164869773387909, Test Loss: 0.5345569625496864 Step 12200/50000, Loss: 0.5169678327441215, Test Loss: 0.5346328243613243 Step 12300/50000, Loss: 0.518125604391098, Test Loss: 0.5323115363717079 Step 12400/50000, Loss: 0.5130830618739128, Test Loss: 0.5332615375518799 Step 12500/50000, Loss: 0.51795594394207, Test Loss: 0.5330265313386917 Step 12600/50000, Loss: 0.5149968475103378, Test Loss: 0.5337191671133041 Step 12700/50000, Loss: 0.5142616364359855, Test Loss: 0.5329175069928169 Step 12800/50000, Loss: 0.5163176089525223, Test Loss: 0.5334835201501846 Step 12900/50000, Loss: 0.5130072566866875, Test Loss: 0.5330120623111725 Step 13000/50000, Loss: 0.5139851236343383, Test Loss: 0.5317537263035774 Step 13100/50000, Loss: 0.5176787361502647, Test Loss: 0.5321345776319504 Step 13200/50000, Loss: 0.5140871369838714, Test Loss: 0.5301696360111237 Step 13300/50000, Loss: 0.513447439968586, Test Loss: 0.5283944457769394 Step 13400/50000, Loss: 0.5100165858864785, Test Loss: 0.5297116637229919 Step 13500/50000, Loss: 0.5150577172636985, Test Loss: 0.5288778468966484 Step 13600/50000, Loss: 0.5141728615760803, Test Loss: 0.530273400247097 Step 13700/50000, Loss: 0.5110060259699821, Test Loss: 0.5297853127121925 Step 13800/50000, Loss: 0.509751378595829, Test Loss: 0.5308617800474167 Step 13900/50000, Loss: 0.5054431283473968, Test Loss: 0.5299336761236191 Step 14000/50000, Loss: 0.5064200758934021, Test Loss: 0.5309884995222092 Step 14100/50000, Loss: 0.5044380423426628, Test Loss: 0.5301645696163177 Step 14200/50000, Loss: 0.5068054673075676, Test Loss: 0.5314349606633186 Step 14300/50000, Loss: 0.5026483401656151, Test Loss: 0.5308374911546707 Step 14400/50000, Loss: 0.5044065856933594, Test Loss: 0.5297647640109062 Step 14500/50000, Loss: 0.5074339452385902, Test Loss: 0.5241239964962006 Step 14600/50000, Loss: 0.5012692919373513, Test Loss: 0.5212065055966377 Step 14700/50000, Loss: 0.49376322239637377, Test Loss: 0.5196936130523682 Step 14800/50000, Loss: 0.4990583032369614, Test Loss: 0.5181355103850365 Step 14900/50000, Loss: 0.4927079975605011, Test Loss: 0.5170556530356407 Step 15000/50000, Loss: 0.4943768587708473, Test Loss: 0.5164052471518517 Step 15100/50000, Loss: 0.49292590498924255, Test Loss: 0.5158631876111031 Step 15200/50000, Loss: 0.49274809896945954, Test Loss: 0.5155334323644638 Step 15300/50000, Loss: 0.48627577275037764, Test Loss: 0.5148376375436783 Step 15400/50000, Loss: 0.4847521889209747, Test Loss: 0.5141593888401985 Step 15500/50000, Loss: 0.49133633136749266, Test Loss: 0.5137338601052761 Step 15600/50000, Loss: 0.4898805782198906, Test Loss: 0.5132023245096207 Step 15700/50000, Loss: 0.48766000390052794, Test Loss: 0.5126618780195713 Step 15800/50000, Loss: 0.48520185083150863, Test Loss: 0.5120491683483124 Step 15900/50000, Loss: 0.487665154337883, Test Loss: 0.5117187798023224 Step 16000/50000, Loss: 0.4880859047174454, Test Loss: 0.5108061395585537 Step 16100/50000, Loss: 0.4845232370495796, Test Loss: 0.5108256340026855 Step 16200/50000, Loss: 0.4834109058976173, Test Loss: 0.5106264725327492 Step 16300/50000, Loss: 0.4823460331559181, Test Loss: 0.5101854875683784 Step 16400/50000, Loss: 0.48043170630931853, Test Loss: 0.509525828063488 Step 16500/50000, Loss: 0.48331175357103345, Test Loss: 0.509604599326849 Step 16600/50000, Loss: 0.4839912721514702, Test Loss: 0.5099755600094795 Step 16700/50000, Loss: 0.4853965476155281, Test Loss: 0.5088035129010677 Step 16800/50000, Loss: 0.4794876632094383, Test Loss: 0.5091969855129719 Step 16900/50000, Loss: 0.4786899435520172, Test Loss: 0.5092427730560303 Step 17000/50000, Loss: 0.48619492262601854, Test Loss: 0.5084580108523369 Step 17100/50000, Loss: 0.48328746378421783, Test Loss: 0.5086438395082951 Step 17200/50000, Loss: 0.4803727975487709, Test Loss: 0.509333610534668 Step 17300/50000, Loss: 0.4814123231172562, Test Loss: 0.5090706311166286 Step 17400/50000, Loss: 0.48198975682258605, Test Loss: 0.5086535513401031 Step 17500/50000, Loss: 0.47598943769931795, Test Loss: 0.5084004029631615 Step 17600/50000, Loss: 0.4715538877248764, Test Loss: 0.5084105804562569 Step 17700/50000, Loss: 0.4818730938434601, Test Loss: 0.5075912587344646 Step 17800/50000, Loss: 0.47702885448932647, Test Loss: 0.5073939561843872 Step 17900/50000, Loss: 0.4776961186528206, Test Loss: 0.5074217617511749 Step 18000/50000, Loss: 0.47445029467344285, Test Loss: 0.5084492862224579 Step 18100/50000, Loss: 0.473343625664711, Test Loss: 0.5077690184116364 Step 18200/50000, Loss: 0.4756625533103943, Test Loss: 0.5075971633195877 Step 18300/50000, Loss: 0.4731039464473724, Test Loss: 0.5075630210340023 Step 18400/50000, Loss: 0.47504206210374833, Test Loss: 0.5072060152888298 Step 18500/50000, Loss: 0.4745807924866676, Test Loss: 0.5067192129790783 Step 18600/50000, Loss: 0.4756285485625267, Test Loss: 0.5072766505181789 Step 18700/50000, Loss: 0.47526715040206907, Test Loss: 0.5070482343435287 Step 18800/50000, Loss: 0.46997760981321335, Test Loss: 0.5069802701473236 Step 18900/50000, Loss: 0.47427710562944414, Test Loss: 0.5074125342071056 Step 19000/50000, Loss: 0.4689899322390556, Test Loss: 0.5073340982198715 Step 19100/50000, Loss: 0.47052473068237305, Test Loss: 0.5075459778308868 Step 19200/50000, Loss: 0.47141150444746016, Test Loss: 0.506905984133482 Step 19300/50000, Loss: 0.4689398431777954, Test Loss: 0.5068896487355232 Step 19400/50000, Loss: 0.4683458387851715, Test Loss: 0.506644032895565 Step 19500/50000, Loss: 0.46928752213716507, Test Loss: 0.5076656304299831 Step 19600/50000, Loss: 0.4651474866271019, Test Loss: 0.5069929584860802 Step 19700/50000, Loss: 0.46936661541461944, Test Loss: 0.5073273032903671 Step 19800/50000, Loss: 0.4647279506921768, Test Loss: 0.5072552412748337 Step 19900/50000, Loss: 0.46571204215288164, Test Loss: 0.5073504708707333 Step 20000/50000, Loss: 0.46870594292879103, Test Loss: 0.50682008638978 Step 20100/50000, Loss: 0.4637542328238487, Test Loss: 0.5065371878445148 Step 20200/50000, Loss: 0.46414817243814466, Test Loss: 0.5061508901417255 Step 20300/50000, Loss: 0.4624897116422653, Test Loss: 0.5061679780483246 Step 20400/50000, Loss: 0.46554854303598403, Test Loss: 0.5060576274991035 Step 20500/50000, Loss: 0.46323122799396516, Test Loss: 0.5067083314061165 Step 20600/50000, Loss: 0.45778128325939177, Test Loss: 0.5072311423718929 Step 20700/50000, Loss: 0.46135765373706816, Test Loss: 0.507310077548027 Step 20800/50000, Loss: 0.4527816712856293, Test Loss: 0.507438138127327 Step 20900/50000, Loss: 0.45377990990877154, Test Loss: 0.5076587907969952 Step 21000/50000, Loss: 0.45502029210329054, Test Loss: 0.5075156912207603 Step 21100/50000, Loss: 0.4520345029234886, Test Loss: 0.5088563002645969 Step 21200/50000, Loss: 0.4501872006058693, Test Loss: 0.5083819702267647 Step 21300/50000, Loss: 0.4599399706721306, Test Loss: 0.5069945603609085 Step 21400/50000, Loss: 0.4857947745919228, Test Loss: 0.5063240826129913 Step 21500/50000, Loss: 0.47852746993303297, Test Loss: 0.5060458853840828 Step 21600/50000, Loss: 0.4717205885052681, Test Loss: 0.5052629262208939 Step 21700/50000, Loss: 0.47905307918787005, Test Loss: 0.5048664696514606 Step 21800/50000, Loss: 0.473784456551075, Test Loss: 0.5045835003256798 Step 21900/50000, Loss: 0.4764270320534706, Test Loss: 0.5043857023119926 Step 22000/50000, Loss: 0.47761034458875656, Test Loss: 0.5042622461915016 Step 22100/50000, Loss: 0.47346028447151184, Test Loss: 0.5041699074208736 Step 22200/50000, Loss: 0.46848847061395643, Test Loss: 0.5041247010231018 Step 22300/50000, Loss: 0.47570513159036637, Test Loss: 0.5038448497653008 Step 22400/50000, Loss: 0.4697534912824631, Test Loss: 0.5038831830024719 Step 22500/50000, Loss: 0.4763588014245033, Test Loss: 0.5035555101931095 Step 22600/50000, Loss: 0.4737188172340393, Test Loss: 0.5034327544271946 Step 22700/50000, Loss: 0.47131730645895004, Test Loss: 0.5033349767327309 Step 22800/50000, Loss: 0.47094243377447126, Test Loss: 0.5031238570809364 Step 22900/50000, Loss: 0.47399418413639066, Test Loss: 0.502818088978529 Step 23000/50000, Loss: 0.46959485530853273, Test Loss: 0.5028169304132462 Step 23100/50000, Loss: 0.46698780059814454, Test Loss: 0.5028919242322445 Step 23200/50000, Loss: 0.46983315378427504, Test Loss: 0.5027219019830227 Step 23300/50000, Loss: 0.465038628578186, Test Loss: 0.5025379881262779 Step 23400/50000, Loss: 0.47093255668878553, Test Loss: 0.5025543235242367 Step 23500/50000, Loss: 0.4711631014943123, Test Loss: 0.5025326684117317 Step 23600/50000, Loss: 0.47176077038049696, Test Loss: 0.5023802779614925 Step 23700/50000, Loss: 0.46660762906074527, Test Loss: 0.5024596005678177 Step 23800/50000, Loss: 0.4664703053236008, Test Loss: 0.5025564096868038 Step 23900/50000, Loss: 0.4703067201375961, Test Loss: 0.5023491904139519 Step 24000/50000, Loss: 0.4716689053177834, Test Loss: 0.5022483840584755 Step 24100/50000, Loss: 0.46665910333395005, Test Loss: 0.5023765712976456 Step 24200/50000, Loss: 0.4694711676239967, Test Loss: 0.5024354085326195 Step 24300/50000, Loss: 0.4672688579559326, Test Loss: 0.5024163909256458 Step 24400/50000, Loss: 0.4629089653491974, Test Loss: 0.5023695789277554 Step 24500/50000, Loss: 0.46146546363830565, Test Loss: 0.5023573376238346 Step 24600/50000, Loss: 0.4693963986635208, Test Loss: 0.5020553283393383 Step 24700/50000, Loss: 0.46189400404691694, Test Loss: 0.5021664276719093 Step 24800/50000, Loss: 0.46608135670423506, Test Loss: 0.5022244416177273 Step 24900/50000, Loss: 0.4589571440219879, Test Loss: 0.5024607218801975 Step 25000/50000, Loss: 0.4629048019647598, Test Loss: 0.5021452978253365 Step 25100/50000, Loss: 0.4642249670624733, Test Loss: 0.5021775141358376 Step 25200/50000, Loss: 0.4602755627036095, Test Loss: 0.5020267590880394 Step 25300/50000, Loss: 0.4630341744422913, Test Loss: 0.5020380951464176 Step 25400/50000, Loss: 0.46108362942934034, Test Loss: 0.5019224025309086 Step 25500/50000, Loss: 0.46431060671806335, Test Loss: 0.5020202063024044 Step 25600/50000, Loss: 0.46064938127994537, Test Loss: 0.5021353252232075 Step 25700/50000, Loss: 0.4596227452158928, Test Loss: 0.5019978806376457 Step 25800/50000, Loss: 0.4627624672651291, Test Loss: 0.5022884383797646 Step 25900/50000, Loss: 0.45465937197208406, Test Loss: 0.5022797510027885 Step 26000/50000, Loss: 0.4602247479557991, Test Loss: 0.5023479834198952 Step 26100/50000, Loss: 0.4547144192457199, Test Loss: 0.5023751519620419 Step 26200/50000, Loss: 0.45940943896770475, Test Loss: 0.5021818913519382 Step 26300/50000, Loss: 0.45368317008018494, Test Loss: 0.5023400746285915 Step 26400/50000, Loss: 0.45632135450839995, Test Loss: 0.5024513900279999 Step 26500/50000, Loss: 0.4554154166579247, Test Loss: 0.5024019181728363 Step 26600/50000, Loss: 0.4558100646734238, Test Loss: 0.5024100318551064 Step 26700/50000, Loss: 0.4524954304099083, Test Loss: 0.5024291835725307 Step 26800/50000, Loss: 0.4520637735724449, Test Loss: 0.5024739019572735 Step 26900/50000, Loss: 0.45630247265100476, Test Loss: 0.5023978725075722 Step 27000/50000, Loss: 0.45039484471082686, Test Loss: 0.502426128834486 Step 27100/50000, Loss: 0.45053510785102846, Test Loss: 0.5023452937602997 Step 27200/50000, Loss: 0.451779320538044, Test Loss: 0.502366878092289 Step 27300/50000, Loss: 0.45302672177553177, Test Loss: 0.5023815371096134 Step 27400/50000, Loss: 0.4481547796726227, Test Loss: 0.5024418719112873 Step 27500/50000, Loss: 0.44427485674619677, Test Loss: 0.5025362223386765 Step 27600/50000, Loss: 0.4456496116518974, Test Loss: 0.5025747939944267 Step 27700/50000, Loss: 0.4400458693504333, Test Loss: 0.5025870725512505 Step 27800/50000, Loss: 0.44037798076868057, Test Loss: 0.5026613883674145 Step 27900/50000, Loss: 0.44103302150964735, Test Loss: 0.5027294494211674 Step 28000/50000, Loss: 0.4356226268410683, Test Loss: 0.5029292479157448 Step 28100/50000, Loss: 0.4362826246023178, Test Loss: 0.5030244551599026 Step 28200/50000, Loss: 0.4531803122162819, Test Loss: 0.5026777647435665 Step 28300/50000, Loss: 0.4652056521177292, Test Loss: 0.5024219490587711 Step 28400/50000, Loss: 0.4623024901747704, Test Loss: 0.5021953955292702 Step 28500/50000, Loss: 0.4698593419790268, Test Loss: 0.5019765570759773 Step 28600/50000, Loss: 0.4713398265838623, Test Loss: 0.5018728114664555 Step 28700/50000, Loss: 0.46969524592161177, Test Loss: 0.5017674267292023 Step 28800/50000, Loss: 0.47093317002058027, Test Loss: 0.5017237216234207 Step 28900/50000, Loss: 0.471667223572731, Test Loss: 0.5016866102814674 Step 29000/50000, Loss: 0.46903841853141787, Test Loss: 0.5016762614250183 Step 29100/50000, Loss: 0.4629881393909454, Test Loss: 0.5016997158527374 Step 29200/50000, Loss: 0.46923464447259905, Test Loss: 0.5016297623515129 Step 29300/50000, Loss: 0.46737633228302, Test Loss: 0.5016226582229137 Step 29400/50000, Loss: 0.4676260563731194, Test Loss: 0.5015422664582729 Step 29500/50000, Loss: 0.4677749601006508, Test Loss: 0.5015387944877148 Step 29600/50000, Loss: 0.4689545515179634, Test Loss: 0.5015231557190418 Step 29700/50000, Loss: 0.4665097558498383, Test Loss: 0.5014706775546074 Step 29800/50000, Loss: 0.46871524572372436, Test Loss: 0.501375675201416 Step 29900/50000, Loss: 0.46577505141496656, Test Loss: 0.5013439916074276 Step 30000/50000, Loss: 0.4621674692630768, Test Loss: 0.5013883039355278 Step 30100/50000, Loss: 0.46299754798412324, Test Loss: 0.5013499930500984 Step 30200/50000, Loss: 0.46211815655231475, Test Loss: 0.5013172663748264 Step 30300/50000, Loss: 0.4665739831328392, Test Loss: 0.5013313479721546 Step 30400/50000, Loss: 0.4669671383500099, Test Loss: 0.5013204962015152 Step 30500/50000, Loss: 0.4651323547959328, Test Loss: 0.5013248324394226 Step 30600/50000, Loss: 0.46224466621875765, Test Loss: 0.5013634636998177 Step 30700/50000, Loss: 0.46512631088495254, Test Loss: 0.5013839080929756 Step 30800/50000, Loss: 0.46646908432245254, Test Loss: 0.5013682022690773 Step 30900/50000, Loss: 0.4650062966346741, Test Loss: 0.5013505108654499 Step 31000/50000, Loss: 0.4644706362485886, Test Loss: 0.501357588917017 Step 31100/50000, Loss: 0.4648225772380829, Test Loss: 0.501377671957016 Step 31200/50000, Loss: 0.45945363998413086, Test Loss: 0.5014038048684597 Step 31300/50000, Loss: 0.45951696336269376, Test Loss: 0.5014193244278431 Step 31400/50000, Loss: 0.4602182424068451, Test Loss: 0.5014018975198269 Step 31500/50000, Loss: 0.4645255380868912, Test Loss: 0.5013576708734035 Step 31600/50000, Loss: 0.4584848949313164, Test Loss: 0.5014085695147514 Step 31700/50000, Loss: 0.459740195274353, Test Loss: 0.5014818608760834 Step 31800/50000, Loss: 0.4567695745825768, Test Loss: 0.5015067383646965 Step 31900/50000, Loss: 0.4580450391769409, Test Loss: 0.5014513395726681 Step 32000/50000, Loss: 0.458971663415432, Test Loss: 0.5014667585492134 Step 32100/50000, Loss: 0.4564270082116127, Test Loss: 0.5014294870197773 Step 32200/50000, Loss: 0.46097953617572784, Test Loss: 0.5014451667666435 Step 32300/50000, Loss: 0.4567831841111183, Test Loss: 0.5014963112771511 Step 32400/50000, Loss: 0.45923821568489076, Test Loss: 0.5014836527407169 Step 32500/50000, Loss: 0.45774698346853254, Test Loss: 0.5014729984104633 Step 32600/50000, Loss: 0.45524754852056504, Test Loss: 0.5014785639941692 Step 32700/50000, Loss: 0.45761937320232393, Test Loss: 0.5015468373894691 Step 32800/50000, Loss: 0.45148681819438935, Test Loss: 0.5016140043735504 Step 32900/50000, Loss: 0.45721014618873596, Test Loss: 0.5016286261379719 Step 33000/50000, Loss: 0.45024279206991197, Test Loss: 0.5016967579722404 Step 33100/50000, Loss: 0.45531607180833816, Test Loss: 0.5016856268048286 Step 33200/50000, Loss: 0.44876173973083494, Test Loss: 0.5017441064119339 Step 33300/50000, Loss: 0.45277988761663435, Test Loss: 0.5017973519861698 Step 33400/50000, Loss: 0.4526422739028931, Test Loss: 0.5018364116549492 Step 33500/50000, Loss: 0.45178504049777984, Test Loss: 0.5018700696527958 Step 33600/50000, Loss: 0.4524467149376869, Test Loss: 0.5018718093633652 Step 33700/50000, Loss: 0.4544000518321991, Test Loss: 0.5018896907567978 Step 33800/50000, Loss: 0.4542145425081253, Test Loss: 0.5018197856843472 Step 33900/50000, Loss: 0.4472826811671257, Test Loss: 0.5018804147839546 Step 34000/50000, Loss: 0.4509144797921181, Test Loss: 0.501849852502346 Step 34100/50000, Loss: 0.4498064476251602, Test Loss: 0.5018765665590763 Step 34200/50000, Loss: 0.45064257711172107, Test Loss: 0.5018821284174919 Step 34300/50000, Loss: 0.44785331904888154, Test Loss: 0.5019666813313961 Step 34400/50000, Loss: 0.44483387500047683, Test Loss: 0.5020587854087353 Step 34500/50000, Loss: 0.44307459115982056, Test Loss: 0.5021104216575623 Step 34600/50000, Loss: 0.44084274530410766, Test Loss: 0.5021173171699047 Step 34700/50000, Loss: 0.43833828091621396, Test Loss: 0.5022791922092438 Step 34800/50000, Loss: 0.4400048726797104, Test Loss: 0.5022997856140137 Step 34900/50000, Loss: 0.4360825061798096, Test Loss: 0.502533707767725 Step 35000/50000, Loss: 0.4363590368628502, Test Loss: 0.5026167184114456 Step 35100/50000, Loss: 0.4613098162412643, Test Loss: 0.5021704509854317 Step 35200/50000, Loss: 0.4619985839724541, Test Loss: 0.5019823834300041 Step 35300/50000, Loss: 0.46362817764282227, Test Loss: 0.501716960221529 Step 35400/50000, Loss: 0.47281751930713656, Test Loss: 0.5015313476324081 Step 35500/50000, Loss: 0.46824741512537005, Test Loss: 0.5014343932271004 Step 35600/50000, Loss: 0.46984292566776276, Test Loss: 0.5013517029583454 Step 35700/50000, Loss: 0.4689937967061997, Test Loss: 0.5013322308659554 Step 35800/50000, Loss: 0.4711910900473595, Test Loss: 0.5013112761080265 Step 35900/50000, Loss: 0.4655409336090088, Test Loss: 0.5012953095138073 Step 36000/50000, Loss: 0.4612109282612801, Test Loss: 0.5013049617409706 Step 36100/50000, Loss: 0.46956610172986984, Test Loss: 0.5012511648237705 Step 36200/50000, Loss: 0.4674121195077896, Test Loss: 0.5012303367257118 Step 36300/50000, Loss: 0.4668622562289238, Test Loss: 0.5011749118566513 Step 36400/50000, Loss: 0.4656849908828735, Test Loss: 0.5011928081512451 Step 36500/50000, Loss: 0.46981538653373717, Test Loss: 0.5011161416769028 Step 36600/50000, Loss: 0.4653474897146225, Test Loss: 0.501069936901331 Step 36700/50000, Loss: 0.46546988666057587, Test Loss: 0.501018974930048 Step 36800/50000, Loss: 0.4640503132343292, Test Loss: 0.500993836671114 Step 36900/50000, Loss: 0.46271914124488833, Test Loss: 0.5010315030813217 Step 37000/50000, Loss: 0.4625481230020523, Test Loss: 0.5009903497993946 Step 37100/50000, Loss: 0.46249220192432405, Test Loss: 0.5009798556566238 Step 37200/50000, Loss: 0.46419251829385755, Test Loss: 0.5010003373026848 Step 37300/50000, Loss: 0.46661396145820616, Test Loss: 0.5009856186807156 Step 37400/50000, Loss: 0.4641845437884331, Test Loss: 0.5010005459189415 Step 37500/50000, Loss: 0.45898341059684755, Test Loss: 0.501035176217556 Step 37600/50000, Loss: 0.46821809977293016, Test Loss: 0.5010434314608574 Step 37700/50000, Loss: 0.4645710316300392, Test Loss: 0.5010265596210957 Step 37800/50000, Loss: 0.4640098667144775, Test Loss: 0.5010393038392067 Step 37900/50000, Loss: 0.46412052005529403, Test Loss: 0.5010656118392944 Step 38000/50000, Loss: 0.46380079209804537, Test Loss: 0.5010442435741425 Step 38100/50000, Loss: 0.45950564950704575, Test Loss: 0.5010580159723759 Step 38200/50000, Loss: 0.4583196929097176, Test Loss: 0.5010528229176998 Step 38300/50000, Loss: 0.4578334194421768, Test Loss: 0.5010626949369907 Step 38400/50000, Loss: 0.4641471928358078, Test Loss: 0.5010474137961864 Step 38500/50000, Loss: 0.45832279920578, Test Loss: 0.5011086650192738 Step 38600/50000, Loss: 0.4596054396033287, Test Loss: 0.5011686645448208 Step 38700/50000, Loss: 0.4553371977806091, Test Loss: 0.5011721514165401 Step 38800/50000, Loss: 0.4574957764148712, Test Loss: 0.5011417493224144 Step 38900/50000, Loss: 0.457391936480999, Test Loss: 0.5011355429887772 Step 39000/50000, Loss: 0.4574935802817345, Test Loss: 0.5010716021060944 Step 39100/50000, Loss: 0.4593946158885956, Test Loss: 0.5011395439505577 Step 39200/50000, Loss: 0.4579893064498901, Test Loss: 0.5011791661381721 Step 39300/50000, Loss: 0.45883000135421753, Test Loss: 0.5011468753218651 Step 39400/50000, Loss: 0.45464896500110624, Test Loss: 0.5011775493621826 Step 39500/50000, Loss: 0.4580179151892662, Test Loss: 0.5011567324399948 Step 39600/50000, Loss: 0.45292833030223845, Test Loss: 0.5012234374880791 Step 39700/50000, Loss: 0.4537820702791214, Test Loss: 0.5013137497007847 Step 39800/50000, Loss: 0.45478583812713624, Test Loss: 0.5013053864240646 Step 39900/50000, Loss: 0.4497983455657959, Test Loss: 0.5013463273644447 Step 40000/50000, Loss: 0.4540665075182915, Test Loss: 0.5013807415962219 Step 40100/50000, Loss: 0.45197425842285155, Test Loss: 0.5013848207890987 Step 40200/50000, Loss: 0.4500792470574379, Test Loss: 0.5014711953699589 Step 40300/50000, Loss: 0.453069281578064, Test Loss: 0.5015005990862846 Step 40400/50000, Loss: 0.4506500625610352, Test Loss: 0.501548171043396 Step 40500/50000, Loss: 0.45183879375457764, Test Loss: 0.5015650875866413 Step 40600/50000, Loss: 0.4543327933549881, Test Loss: 0.5015492886304855 Step 40700/50000, Loss: 0.4508472245931625, Test Loss: 0.5015063136816025 Step 40800/50000, Loss: 0.45093833416700363, Test Loss: 0.5015206262469292 Step 40900/50000, Loss: 0.44664413928985597, Test Loss: 0.5015188828110695 Step 41000/50000, Loss: 0.451400865316391, Test Loss: 0.5015908963978291 Step 41100/50000, Loss: 0.44933731645345687, Test Loss: 0.5015690959990025 Step 41200/50000, Loss: 0.4465946346521378, Test Loss: 0.50167266279459 Step 41300/50000, Loss: 0.4445001712441444, Test Loss: 0.5017509907484055 Step 41400/50000, Loss: 0.4418393585085869, Test Loss: 0.5018127113580704 Step 41500/50000, Loss: 0.44124255418777464, Test Loss: 0.5018390603363514 Step 41600/50000, Loss: 0.4386308166384697, Test Loss: 0.5019795894622803 Step 41700/50000, Loss: 0.44028086751699447, Test Loss: 0.5020023062825203 Step 41800/50000, Loss: 0.4352584308385849, Test Loss: 0.5022229515016079 Step 41900/50000, Loss: 0.4381945076584816, Test Loss: 0.5022110231220722 Step 42000/50000, Loss: 0.46495745092630386, Test Loss: 0.5017723441123962 Step 42100/50000, Loss: 0.46158114582300186, Test Loss: 0.5016018971800804 Step 42200/50000, Loss: 0.4646707597374916, Test Loss: 0.5013327859342098 Step 42300/50000, Loss: 0.4723111265897751, Test Loss: 0.5011717230081558 Step 42400/50000, Loss: 0.46732883155345917, Test Loss: 0.5010780096054077 Step 42500/50000, Loss: 0.4692428630590439, Test Loss: 0.5010198876261711 Step 42600/50000, Loss: 0.4681865236163139, Test Loss: 0.5010362900793552 Step 42700/50000, Loss: 0.4688673847913742, Test Loss: 0.5010101981461048 Step 42800/50000, Loss: 0.46303324073553087, Test Loss: 0.5010134018957615 Step 42900/50000, Loss: 0.46190293282270434, Test Loss: 0.500985860824585 Step 43000/50000, Loss: 0.4691228488087654, Test Loss: 0.500946830958128 Step 43100/50000, Loss: 0.4677538934350014, Test Loss: 0.5009460374712944 Step 43200/50000, Loss: 0.46625195145606996, Test Loss: 0.5008740350604057 Step 43300/50000, Loss: 0.4640511643886566, Test Loss: 0.5009244792163372 Step 43400/50000, Loss: 0.4665636906027794, Test Loss: 0.5008159950375557 Step 43500/50000, Loss: 0.4671597841382027, Test Loss: 0.5007410608232021 Step 43600/50000, Loss: 0.4641227728128433, Test Loss: 0.5007301419973373 Step 43700/50000, Loss: 0.4633761635422707, Test Loss: 0.5007203817367554 Step 43800/50000, Loss: 0.46214351028203965, Test Loss: 0.5007463805377483 Step 43900/50000, Loss: 0.46062709659337997, Test Loss: 0.5006855353713036 Step 44000/50000, Loss: 0.4634872642159462, Test Loss: 0.5007058121263981 Step 44100/50000, Loss: 0.46471895784139633, Test Loss: 0.5007161721587181 Step 44200/50000, Loss: 0.4659980982542038, Test Loss: 0.5006871521472931 Step 44300/50000, Loss: 0.46030958235263825, Test Loss: 0.5007446706295013 Step 44400/50000, Loss: 0.4600467967987061, Test Loss: 0.5007464215159416 Step 44500/50000, Loss: 0.4673938220739365, Test Loss: 0.5007539130747318 Step 44600/50000, Loss: 0.46498212337493894, Test Loss: 0.5007070824503899 Step 44700/50000, Loss: 0.46189157575368883, Test Loss: 0.5007654540240765 Step 44800/50000, Loss: 0.46281050354242326, Test Loss: 0.5007943734526634 Step 44900/50000, Loss: 0.4639808592200279, Test Loss: 0.5007562525570393 Step 45000/50000, Loss: 0.4576499903202057, Test Loss: 0.5007820129394531 Step 45100/50000, Loss: 0.4534341612458229, Test Loss: 0.5007734969258308 Step 45200/50000, Loss: 0.4638801547884941, Test Loss: 0.5007734447717667 Step 45300/50000, Loss: 0.4590744495391846, Test Loss: 0.5007826164364815 Step 45400/50000, Loss: 0.4599402379989624, Test Loss: 0.5008482784032822 Step 45500/50000, Loss: 0.45705058693885803, Test Loss: 0.5009209364652634 Step 45600/50000, Loss: 0.4559430235624313, Test Loss: 0.5008799694478512 Step 45700/50000, Loss: 0.45830257862806323, Test Loss: 0.500896867364645 Step 45800/50000, Loss: 0.45596627026796344, Test Loss: 0.5008640959858894 Step 45900/50000, Loss: 0.45789127141237257, Test Loss: 0.5007930248975754 Step 46000/50000, Loss: 0.4571463504433632, Test Loss: 0.5008691176772118 Step 46100/50000, Loss: 0.4582295683026314, Test Loss: 0.5009022168815136 Step 46200/50000, Loss: 0.4579246589541435, Test Loss: 0.5008432939648628 Step 46300/50000, Loss: 0.4534035176038742, Test Loss: 0.50087920576334 Step 46400/50000, Loss: 0.45722023576498033, Test Loss: 0.5008842125535011 Step 46500/50000, Loss: 0.4517281222343445, Test Loss: 0.500956941395998 Step 46600/50000, Loss: 0.4535133907198906, Test Loss: 0.5010455660521984 Step 46700/50000, Loss: 0.4543594112992287, Test Loss: 0.5010179914534092 Step 46800/50000, Loss: 0.4521429255604744, Test Loss: 0.5010643899440765 Step 46900/50000, Loss: 0.4514783936738968, Test Loss: 0.5010826289653778 Step 47000/50000, Loss: 0.452489273250103, Test Loss: 0.5011294670403004 Step 47100/50000, Loss: 0.44850277453660964, Test Loss: 0.5011962167918682 Step 47200/50000, Loss: 0.45421678364276885, Test Loss: 0.5012331902980804 Step 47300/50000, Loss: 0.45029215425252916, Test Loss: 0.5012377202510834 Step 47400/50000, Loss: 0.4513078424334526, Test Loss: 0.5012977793812752 Step 47500/50000, Loss: 0.45403604418039323, Test Loss: 0.5012771934270859 Step 47600/50000, Loss: 0.4494051992893219, Test Loss: 0.5012643933296204 Step 47700/50000, Loss: 0.44979759931564334, Test Loss: 0.5012452751398087 Step 47800/50000, Loss: 0.4478905948996544, Test Loss: 0.501251682639122 Step 47900/50000, Loss: 0.45096522599458694, Test Loss: 0.5012961141765118 Step 48000/50000, Loss: 0.4485600805282593, Test Loss: 0.5013213120400906 Step 48100/50000, Loss: 0.443496415913105, Test Loss: 0.5014169104397297 Step 48200/50000, Loss: 0.4468724298477173, Test Loss: 0.5014543868601322 Step 48300/50000, Loss: 0.43916097432374956, Test Loss: 0.501509003341198 Step 48400/50000, Loss: 0.4395106253027916, Test Loss: 0.5015630647540092 Step 48500/50000, Loss: 0.44071861177682875, Test Loss: 0.5016614384949207 Step 48600/50000, Loss: 0.4379118290543556, Test Loss: 0.5017481371760368 Step 48700/50000, Loss: 0.4357627189159393, Test Loss: 0.5018959939479828 Step 48800/50000, Loss: 0.44376066118478774, Test Loss: 0.501755379140377 Step 48900/50000, Loss: 0.4674203127622604, Test Loss: 0.501388244330883 Step 49000/50000, Loss: 0.4611475524306297, Test Loss: 0.5012223459780216 Step 49100/50000, Loss: 0.4634071630239487, Test Loss: 0.500988133251667 Step 49200/50000, Loss: 0.4706606161594391, Test Loss: 0.5008767060935497 Step 49300/50000, Loss: 0.4656855249404907, Test Loss: 0.500786330550909 Step 49400/50000, Loss: 0.4685150933265686, Test Loss: 0.5007378049194813 Step 49500/50000, Loss: 0.46967413753271103, Test Loss: 0.5007382407784462 Step 49600/50000, Loss: 0.4655856826901436, Test Loss: 0.5007290728390217 Step 49700/50000, Loss: 0.460963761806488, Test Loss: 0.5007485263049603 Step 49800/50000, Loss: 0.468090540766716, Test Loss: 0.5006921850144863 Step 49900/50000, Loss: 0.46244903177022934, Test Loss: 0.5006965585052967 Step 50000/50000, Loss: 0.46912035673856733, Test Loss: 0.500646710395813
if use_existing_model:
print("Existing model used, no loss curves shown.")
plt.imshow(plt.imread("./sft_loss_curve.png"))
else:
plt.figure(figsize=(10, 6))
plt.plot(losses, label="Train Loss", color='blue')
plt.plot(test_losses, label="Test Loss", color='red')
plt.xlabel('Checkpoint')
plt.ylabel('Loss')
plt.title('Supervised Fine Tuning - Training and Test Loss Over Time')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()
if not use_existing_model:
torch.save(model, f"./sft_final.pth")With the fine tuned model, we can perform a more natural form of interence. Instead of formatting all of our prompts as next token prediction, we can have a more natural Q&A style format with the model
We are using a very small model and a very small set of data compared to modern LLMs, so our model is not going to perform very well on most questions. However, it is outputting responses that are at least related to the prompt and are formatted in a correct way. It is very cool to see the LLM starting to come together! As we scale up the model, data, etc... the responses will become more factual, realistic, and contextually accurate. At this point, the majority of the responses are hallucinations.
def sft_inference(prompt,torch_model, max_new_tokens):
torch_model.eval()
prompt = "<Question>" + prompt + "</Question>" + "<Answer>" # Wrap the prompt in <Question> and start inference with <Answer>
with torch.no_grad():
tokens = hf_tokenizer.encode(prompt) # Tokenize the prompt
for _ in range(max_new_tokens):
if tokens[-1] == hf_tokenizer.eos_token_id: # Stop if we reach the end of the sequence
break
num_tokens = len(tokens) #
tokens_padded = tokens + [hf_tokenizer.eos_token_id] * (config.seq_len - num_tokens) # pad the sequence with eos token
tokens_padded = torch.tensor(tokens_padded).unsqueeze(0).to(device)
logits = torch_model(tokens_padded) # Forward pass through the model
probabilities = torch.softmax(logits[0, num_tokens-1, :], dim=-1) # Get the probabilities of the last token
predicted_token = torch.argmax(probabilities).item() # Greedy decoding, change to sampling for more diversity
tokens.append(predicted_token)
# Strip the text to between the <Answer></Answer> tags
full_answer = hf_tokenizer.decode(tokens)
answer_start = full_answer.find("<Answer>") + len("<Answer>")
answer_end = full_answer.find("</Answer>")
return full_answer[answer_start:answer_end]print("Predicted:", sft_inference("Who is the most powerful leader in the west?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the sun?", model, max_new_tokens=20))
print("Predicted:", sft_inference("What color is the ocean", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many planets are in the solar system", model, max_new_tokens=20))
print("Predicted:", sft_inference("What three countries are in north america?", model, max_new_tokens=20))
print("Predicted:", sft_inference("How many eyes do humans have?", model, max_new_tokens=20))
Predicted: Theodore Roosevelt Predicted: Yellow Predicted: Red Predicted: About 20,000 planets? Predicted: United States and Canada Predicted: Two eyes