Generating New Drug SMILES Data Using Model RNN-LTSM

The goal of the task I'm working on is to generate new molecules using a Recurrent Neural Network. De novo simply means to synthesize new. The idea is to train the model to learn patterns in SMILES strings so that the output generated can match valid molecules. SMILES is a string representation of a molecule based on its structure and different components, making it a computer-friendly way to represent molecules.

Steps:

  1. Install Rdkit: Rdkit is a cheminformatics toolkit that enables working with chemical structures and data. To install it, you can use the following command:

    !pip install rdkit-pypi
    
  2. Install DeepChem: DeepChem is a Python library that provides tools for deep learning in cheminformatics. You can install it using the following command:

    !pip install deepchem
    
  3. Import Libraries: In your Jupyter Notebook, import the required libraries for working with Rdkit and DeepChem:

import general packages to per-process data SMILES¶

first step : install Rdkit package to deal with Chemical data SMILES

  • loading dataset and convert to DATAFRAME using Rdkit.Panads.Tools and do some analysis on it
In [1]:
%%capture
!pip install -q condacolab
import condacolab
condacolab.install()
# I HAVE ALREADY INSTALLED ON MY ENVIROMENET
!mamba install -c conda-forge rdkit
In [2]:
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
curl: /usr/local/lib/libcurl.so.4: no version information available (required by curl)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3457  100  3457    0     0  30184      0 --:--:-- --:--:-- --:--:-- 30324
In [3]:
import deepchem_installer
deepchem_installer.install()
add /root/miniconda/lib/python3.10/site-packages to PYTHONPATH
all packages are already installed
In [4]:
%%capture
!pip install transformers
!pip install --pre deepchem
In [5]:
import deepchem
deepchem.__version__
No normalization for AvgIpc. Feature removed!
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/local/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. No module named 'haiku'
Out[5]:
'2.7.2.dev'
In [6]:
from rdkit.Chem import PandasTools
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
import os
from rdkit import Chem
from rdkit import RDConfig
import numpy as np
from rdkit.Chem import Draw , Descriptors
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from google.colab import drive , files
drive.mount('/drive')
Mounted at /drive

load Dataset and process:¶

  • read the CSV file, and the delimiter is set to "\t" to handle tab-separated values. The columns are named "smiles" and "labels" using the names parameter. The resulting DataFrame is stored in the variable data. Finally, the set_index method is called to set the "smiles" column as the index of the DataFrame, but the inplace parameter is set to False, so the original DataFrame is not modified, and the result is displayed in the output of the code cell.
In [7]:
data_training ='/drive/My Drive/smiles'
smifile = data_training + '/training.smi'
data = pd.read_csv(smifile, delimiter = "\t", names = ["smiles","labels"], index_col=False)
data.set_index("smiles",inplace=False)
Out[7]:
labels
smiles
CC(N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23)c4ccccc4 0
CN(C1CCN(CC1)c2cc(ncn2)C(F)(F)F)C(=O)C3=CN(CC=C)C(=O)c4[nH]ccc34 0
CN1C(=O)C=Cc2ccccc12 0
CC(C)N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23 0
CC(=O)c1cc(c2ccccc2S(=O)(=O)C)c3ccccn13 0
... ...
COc1cc2N(C)C(=O)C=C(C)c2cc1NS(=O)(=O)c3ccc(cc3)C#N 1
CN1C(=O)C=Cc2cc(NS(=O)(=O)c3ccc(cc3)C#N)ccc12 1
CN1C(=O)C=Cc2cc(NS(=O)(=O)c3ccc(cc3)C#N)ccc12 1
Cc1nnc2c3ccccc3c(nn12)c4ccc(N5CCOCC5)c(NS(=O)(=O)c6ccc(Cl)cc6)c4 1
CN1C(=O)C(=Cc2cc(NS(=O)(=O)c3ccc(cc3)C#N)ccc12)C 1

102 rows × 1 columns

In [8]:
data.head(10)
Out[8]:
smiles labels
0 CC(N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23... 0
1 CN(C1CCN(CC1)c2cc(ncn2)C(F)(F)F)C(=O)C3=CN(CC=... 0
2 CN1C(=O)C=Cc2ccccc12 0
3 CC(C)N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23 0
4 CC(=O)c1cc(c2ccccc2S(=O)(=O)C)c3ccccn13 0
5 CC(=O)c1cc(c2ccccc2S(=O)(=O)C)c3cc(Oc4ccccc4)c... 0
6 CNC(=O)N1CCc2c(C1)c(nn2C3CCOCC3)N4CCCc5cc(c6cn... 0
7 CC(=O)N1CCc2[nH]nc(Nc3ccccc3)c2C1 0
8 CC(=O)N1CCc2c(C1)c(Nc3ccc(cc3F)c4cnn(C)c4)nn2[... 0
9 CN(C1CCN(C)CC1)C(=O)C2=CN(C)C(=O)c3[nH]ccc23 0
In [9]:
data.describe()
Out[9]:
labels
count 102.000000
mean 0.656863
std 0.477101
min 0.000000
25% 0.000000
50% 1.000000
75% 1.000000
max 1.000000
In [10]:
data.iloc[0:5]
data.iloc[:-4]
Out[10]:
smiles labels
0 CC(N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23... 0
1 CN(C1CCN(CC1)c2cc(ncn2)C(F)(F)F)C(=O)C3=CN(CC=... 0
2 CN1C(=O)C=Cc2ccccc12 0
3 CC(C)N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23 0
4 CC(=O)c1cc(c2ccccc2S(=O)(=O)C)c3ccccn13 0
... ... ...
93 COc1ccc(cc1)S(=O)(=O)Nc2cc(ccc2N3CCN(C)CC3)c4n... 1
94 COc1cc(cc(C)c1CN(C)C)C2=CN(C)C(=O)C(=C2)C 1
95 C[C@@H]1CC(=O)Nc2cccc(c2N1)c3ccc4c(c3)c(nn4C)c... 1
96 C[C@@H]1CC(=O)Nc2cccc(c2N1)c3ccc4c(c3)c(nn4C)c... 1
97 COc1cc2N(C)C(=O)C=C(C)c2cc1NS(=O)(=O)c3ccc(cc3... 1

98 rows × 2 columns

In [11]:
data.smiles[0]
Out[11]:
'CC(N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23)c4ccccc4'

Virtualize Chemical Components:¶

  • The function mol_with_atom_index(mol) that takes a molecule (mol) as input and sets the atom map numbers to the atom indices in the molecule. This is done to show the atom indices when drawing the molecule.
  • Next, IPythonConsole.drawOptions.addAtomIndices = True is set to enable displaying the atom indices in the molecule visualization.

    1. The size variable is set to 300, 300, which determines the size of the molecule visualization.

    2. A molecule (mol_1) is created using the SMILES representation of the third row in the data DataFrame (data.smiles.iloc[2]).

    3. The mol_with_atom_index function is called with mol_1 as an argument, and the result is stored in the variable mol_indexAtom.

    4. The last line of the code cell will display the molecule visualization with atom indices.

In [12]:
def mol_with_atom_index(mol):
    for atom in mol.GetAtoms():
        atom.SetAtomMapNum(atom.GetIdx())
    return mol
In [13]:
IPythonConsole.drawOptions.addAtomIndices = True
size=IPythonConsole.molSize = 300,300
mol_1=Chem.MolFromSmiles(data.smiles.iloc[2])
mol_indexAtom=mol_with_atom_index(mol_1)
mol_indexAtom
Out[13]:
No description has been provided for this image
In [14]:
IPythonConsole.molSize = 300,300

mol_indexAtom
Out[14]:
No description has been provided for this image
In [15]:
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.molSize = 300,300
for i in range(5):
  j=0
  for j in iter(data.smiles):
    j+=j
  mol=Chem.MolFromSmiles(j)
  mol
mol
Out[15]:
No description has been provided for this image
In [16]:
IPythonConsole.molSize = 400,400
mol_with_atom_index(mol)
Out[16]:
No description has been provided for this image

transformations and feature engineering :¶

In this code cell, we perform transformations and feature engineering for a Chemical dataset input. The dataset contains SMILES representations of chemical compounds.

  1. We create a charset containing unique characters from all SMILES strings along with special characters ("!" for start and "E" for end).

  2. We create dictionaries char_to_int and int_to_char to map characters to integers and vice versa.

  3. The variable embed is set to the length of the longest SMILES string in the dataset plus 5, which determines the embedding size for one-hot encoding.

  4. The function vectorize(smiles) is defined to convert SMILES strings to one-hot encoded vectors. The function iterates through each SMILES string and encodes it as a one-hot vector. It adds special characters ("!" at the start and "E" at the end) to indicate the beginning and end of the sequence.

  5. The vectorize function is applied to the training and test datasets (smiles_train and smiles_test) to create input (X_train and X_test) and output (Y_train and Y_test) arrays for training and testing purposes.

  6. Information about the first SMILES string in the training dataset is displayed using print(smiles_train.iloc[0]).

  7. A visualization of the one-hot encoded representation of the first SMILES string in the training dataset is displayed using plt.matshow(X_train[0].T).

In [17]:
from sklearn.model_selection import train_test_split
smiles_train, smiles_test = train_test_split(data["smiles"], random_state=42)
print(smiles_train.shape)
print(smiles_test.shape)
(76,)
(26,)
In [18]:
charset = set("".join(list(data.smiles))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in data.smiles]) + 5
print(str(charset))
print(len(charset), embed)
{'s', '!', '@', ']', 'C', 'S', '5', 'F', 'r', '1', ')', 'E', '=', 'H', '#', 'O', 'l', '4', '(', '3', '6', 'n', 'B', 'N', '[', 'c', '2'}
27 82
In [19]:
def vectorize(smiles):
        one_hot =  np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
        for i,smile in enumerate(smiles):
            #encode the startchar
            one_hot[i,0,char_to_int["!"]] = 1
            #encode the rest of the chars
            for j,c in enumerate(smile):
                one_hot[i,j+1,char_to_int[c]] = 1
            #Encode endchar
            one_hot[i,len(smile)+1:,char_to_int["E"]] = 1
        #Return two, one for input and the other for output
        return one_hot[:,0:-1,:], one_hot[:,1:,:]
X_train, Y_train = vectorize(smiles_train.values)
X_test,Y_test = vectorize(smiles_test.values)
print(smiles_train.iloc[0])
plt.matshow(X_train[0].T)
#print X_train.shape
C[C@H]1C[C@@H](Nc2ccc(Cl)cc2)c3cc(ccc3N1C(=O)C)c4ccc(cc4)C(=O)O
Out[19]:
<matplotlib.image.AxesImage at 0x7db0adf3fa90>
No description has been provided for this image
In [20]:
"".join([int_to_char[idx] for idx in np.argmax(X_train[0,:,:], axis=1)])
Out[20]:
'!C[C@H]1C[C@@H](Nc2ccc(Cl)cc2)c3cc(ccc3N1C(=O)C)c4ccc(cc4)C(=O)OEEEEEEEEEEEEEEEEE'
In [21]:
#Import Keras objects
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Concatenate
from keras import regularizers
input_shape = X_train.shape[1:]
output_dim = Y_train.shape[-1]
latent_dim = 64
lstm_dim = 64
In [22]:
unroll = False
encoder_inputs = Input(shape=input_shape)
encoder = LSTM(lstm_dim, return_state=True,
                unroll=unroll)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
states = Concatenate(axis=-1)([state_h, state_c])
neck = Dense(latent_dim, activation="relu")
neck_outputs = neck(states)
decode_h = Dense(lstm_dim, activation="relu")
decode_c = Dense(lstm_dim, activation="relu")
state_h_decoded =  decode_h(neck_outputs)
state_c_decoded =  decode_c(neck_outputs)
encoder_states = [state_h_decoded, state_c_decoded]
decoder_inputs = Input(shape=input_shape)
decoder_lstm = LSTM(lstm_dim,
                    return_sequences=True,
                    unroll=unroll
                   )
decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(output_dim, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
#Define the model, that inputs the training vector for two places, and predicts one character ahead of the input
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(model.summary())
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 81, 27)]     0           []                               
                                                                                                  
 lstm (LSTM)                    [(None, 64),         23552       ['input_1[0][0]']                
                                 (None, 64),                                                      
                                 (None, 64)]                                                      
                                                                                                  
 concatenate (Concatenate)      (None, 128)          0           ['lstm[0][1]',                   
                                                                  'lstm[0][2]']                   
                                                                                                  
 dense (Dense)                  (None, 64)           8256        ['concatenate[0][0]']            
                                                                                                  
 input_2 (InputLayer)           [(None, 81, 27)]     0           []                               
                                                                                                  
 dense_1 (Dense)                (None, 64)           4160        ['dense[0][0]']                  
                                                                                                  
 dense_2 (Dense)                (None, 64)           4160        ['dense[0][0]']                  
                                                                                                  
 lstm_1 (LSTM)                  (None, 81, 64)       23552       ['input_2[0][0]',                
                                                                  'dense_1[0][0]',                
                                                                  'dense_2[0][0]']                
                                                                                                  
 dense_3 (Dense)                (None, 81, 27)       1755        ['lstm_1[0][0]']                 
                                                                                                  
==================================================================================================
Total params: 65,435
Trainable params: 65,435
Non-trainable params: 0
__________________________________________________________________________________________________
None
In [23]:
from keras.callbacks import History, ReduceLROnPlateau
h = History()
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5)
WARNING:tensorflow:`epsilon` argument is deprecated and will be removed, use `min_delta` instead.
In [24]:
from keras.optimizers import RMSprop, Adam
#opt=Adam(lr=0.005) #Default 0.001
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit([X_train,X_train],Y_train,
                    epochs=200,
                    batch_size=256,
                    shuffle=True,
                    callbacks=[h])
Epoch 1/200
1/1 [==============================] - 5s 5s/step - loss: 3.3220
Epoch 2/200
1/1 [==============================] - 0s 136ms/step - loss: 3.2955
Epoch 3/200
1/1 [==============================] - 0s 120ms/step - loss: 3.2697
Epoch 4/200
1/1 [==============================] - 0s 114ms/step - loss: 3.2442
Epoch 5/200
1/1 [==============================] - 0s 114ms/step - loss: 3.2187
Epoch 6/200
1/1 [==============================] - 0s 113ms/step - loss: 3.1930
Epoch 7/200
1/1 [==============================] - 0s 112ms/step - loss: 3.1668
Epoch 8/200
1/1 [==============================] - 0s 114ms/step - loss: 3.1397
Epoch 9/200
1/1 [==============================] - 0s 112ms/step - loss: 3.1115
Epoch 10/200
1/1 [==============================] - 0s 132ms/step - loss: 3.0817
Epoch 11/200
1/1 [==============================] - 0s 122ms/step - loss: 3.0499
Epoch 12/200
1/1 [==============================] - 0s 121ms/step - loss: 3.0155
Epoch 13/200
1/1 [==============================] - 0s 120ms/step - loss: 2.9776
Epoch 14/200
1/1 [==============================] - 0s 112ms/step - loss: 2.9354
Epoch 15/200
1/1 [==============================] - 0s 118ms/step - loss: 2.8873
Epoch 16/200
1/1 [==============================] - 0s 119ms/step - loss: 2.8319
Epoch 17/200
1/1 [==============================] - 0s 126ms/step - loss: 2.7673
Epoch 18/200
1/1 [==============================] - 0s 121ms/step - loss: 2.6916
Epoch 19/200
1/1 [==============================] - 0s 123ms/step - loss: 2.6029
Epoch 20/200
1/1 [==============================] - 0s 120ms/step - loss: 2.5000
Epoch 21/200
1/1 [==============================] - 0s 112ms/step - loss: 2.3874
Epoch 22/200
1/1 [==============================] - 0s 118ms/step - loss: 2.2937
Epoch 23/200
1/1 [==============================] - 0s 113ms/step - loss: 2.2500
Epoch 24/200
1/1 [==============================] - 0s 110ms/step - loss: 2.2352
Epoch 25/200
1/1 [==============================] - 0s 110ms/step - loss: 2.2243
Epoch 26/200
1/1 [==============================] - 0s 137ms/step - loss: 2.2074
Epoch 27/200
1/1 [==============================] - 0s 119ms/step - loss: 2.1816
Epoch 28/200
1/1 [==============================] - 0s 116ms/step - loss: 2.1479
Epoch 29/200
1/1 [==============================] - 0s 113ms/step - loss: 2.1084
Epoch 30/200
1/1 [==============================] - 0s 115ms/step - loss: 2.0706
Epoch 31/200
1/1 [==============================] - 0s 111ms/step - loss: 2.0446
Epoch 32/200
1/1 [==============================] - 0s 112ms/step - loss: 2.0314
Epoch 33/200
1/1 [==============================] - 0s 123ms/step - loss: 2.0232
Epoch 34/200
1/1 [==============================] - 0s 189ms/step - loss: 2.0120
Epoch 35/200
1/1 [==============================] - 0s 193ms/step - loss: 1.9938
Epoch 36/200
1/1 [==============================] - 0s 202ms/step - loss: 1.9674
Epoch 37/200
1/1 [==============================] - 0s 186ms/step - loss: 1.9345
Epoch 38/200
1/1 [==============================] - 0s 201ms/step - loss: 1.9001
Epoch 39/200
1/1 [==============================] - 0s 185ms/step - loss: 1.8694
Epoch 40/200
1/1 [==============================] - 0s 196ms/step - loss: 1.8434
Epoch 41/200
1/1 [==============================] - 0s 204ms/step - loss: 1.8172
Epoch 42/200
1/1 [==============================] - 0s 183ms/step - loss: 1.7916
Epoch 43/200
1/1 [==============================] - 0s 194ms/step - loss: 1.7724
Epoch 44/200
1/1 [==============================] - 0s 197ms/step - loss: 1.7556
Epoch 45/200
1/1 [==============================] - 0s 190ms/step - loss: 1.7357
Epoch 46/200
1/1 [==============================] - 0s 194ms/step - loss: 1.7150
Epoch 47/200
1/1 [==============================] - 0s 185ms/step - loss: 1.7010
Epoch 48/200
1/1 [==============================] - 0s 188ms/step - loss: 1.6811
Epoch 49/200
1/1 [==============================] - 0s 193ms/step - loss: 1.6680
Epoch 50/200
1/1 [==============================] - 0s 192ms/step - loss: 1.6530
Epoch 51/200
1/1 [==============================] - 0s 200ms/step - loss: 1.6398
Epoch 52/200
1/1 [==============================] - 0s 186ms/step - loss: 1.6267
Epoch 53/200
1/1 [==============================] - 0s 191ms/step - loss: 1.6193
Epoch 54/200
1/1 [==============================] - 0s 185ms/step - loss: 1.6057
Epoch 55/200
1/1 [==============================] - 0s 188ms/step - loss: 1.6004
Epoch 56/200
1/1 [==============================] - 0s 204ms/step - loss: 1.5890
Epoch 57/200
1/1 [==============================] - 0s 195ms/step - loss: 1.5803
Epoch 58/200
1/1 [==============================] - 0s 216ms/step - loss: 1.5746
Epoch 59/200
1/1 [==============================] - 0s 228ms/step - loss: 1.5636
Epoch 60/200
1/1 [==============================] - 0s 201ms/step - loss: 1.5581
Epoch 61/200
1/1 [==============================] - 0s 205ms/step - loss: 1.5516
Epoch 62/200
1/1 [==============================] - 0s 218ms/step - loss: 1.5440
Epoch 63/200
1/1 [==============================] - 0s 197ms/step - loss: 1.5396
Epoch 64/200
1/1 [==============================] - 0s 205ms/step - loss: 1.5332
Epoch 65/200
1/1 [==============================] - 0s 212ms/step - loss: 1.5287
Epoch 66/200
1/1 [==============================] - 0s 202ms/step - loss: 1.5240
Epoch 67/200
1/1 [==============================] - 0s 203ms/step - loss: 1.5189
Epoch 68/200
1/1 [==============================] - 0s 201ms/step - loss: 1.5151
Epoch 69/200
1/1 [==============================] - 0s 196ms/step - loss: 1.5099
Epoch 70/200
1/1 [==============================] - 0s 189ms/step - loss: 1.5061
Epoch 71/200
1/1 [==============================] - 0s 198ms/step - loss: 1.5016
Epoch 72/200
1/1 [==============================] - 0s 212ms/step - loss: 1.4975
Epoch 73/200
1/1 [==============================] - 0s 212ms/step - loss: 1.4943
Epoch 74/200
1/1 [==============================] - 0s 195ms/step - loss: 1.4902
Epoch 75/200
1/1 [==============================] - 0s 195ms/step - loss: 1.4865
Epoch 76/200
1/1 [==============================] - 0s 195ms/step - loss: 1.4834
Epoch 77/200
1/1 [==============================] - 0s 201ms/step - loss: 1.4807
Epoch 78/200
1/1 [==============================] - 0s 194ms/step - loss: 1.4798
Epoch 79/200
1/1 [==============================] - 0s 201ms/step - loss: 1.4778
Epoch 80/200
1/1 [==============================] - 0s 197ms/step - loss: 1.4809
Epoch 81/200
1/1 [==============================] - 0s 192ms/step - loss: 1.4669
Epoch 82/200
1/1 [==============================] - 0s 195ms/step - loss: 1.4830
Epoch 83/200
1/1 [==============================] - 0s 195ms/step - loss: 1.4899
Epoch 84/200
1/1 [==============================] - 0s 194ms/step - loss: 1.4959
Epoch 85/200
1/1 [==============================] - 0s 200ms/step - loss: 1.4657
Epoch 86/200
1/1 [==============================] - 0s 199ms/step - loss: 1.4808
Epoch 87/200
1/1 [==============================] - 0s 211ms/step - loss: 1.4698
Epoch 88/200
1/1 [==============================] - 0s 214ms/step - loss: 1.4607
Epoch 89/200
1/1 [==============================] - 0s 202ms/step - loss: 1.4676
Epoch 90/200
1/1 [==============================] - 0s 202ms/step - loss: 1.4644
Epoch 91/200
1/1 [==============================] - 0s 194ms/step - loss: 1.4513
Epoch 92/200
1/1 [==============================] - 0s 201ms/step - loss: 1.4515
Epoch 93/200
1/1 [==============================] - 0s 197ms/step - loss: 1.4535
Epoch 94/200
1/1 [==============================] - 0s 193ms/step - loss: 1.4420
Epoch 95/200
1/1 [==============================] - 0s 196ms/step - loss: 1.4480
Epoch 96/200
1/1 [==============================] - 0s 203ms/step - loss: 1.4430
Epoch 97/200
1/1 [==============================] - 0s 197ms/step - loss: 1.4365
Epoch 98/200
1/1 [==============================] - 0s 196ms/step - loss: 1.4408
Epoch 99/200
1/1 [==============================] - 0s 198ms/step - loss: 1.4311
Epoch 100/200
1/1 [==============================] - 0s 201ms/step - loss: 1.4327
Epoch 101/200
1/1 [==============================] - 0s 209ms/step - loss: 1.4308
Epoch 102/200
1/1 [==============================] - 0s 196ms/step - loss: 1.4251
Epoch 103/200
1/1 [==============================] - 0s 200ms/step - loss: 1.4261
Epoch 104/200
1/1 [==============================] - 0s 196ms/step - loss: 1.4224
Epoch 105/200
1/1 [==============================] - 0s 200ms/step - loss: 1.4190
Epoch 106/200
1/1 [==============================] - 0s 192ms/step - loss: 1.4191
Epoch 107/200
1/1 [==============================] - 0s 143ms/step - loss: 1.4149
Epoch 108/200
1/1 [==============================] - 0s 138ms/step - loss: 1.4129
Epoch 109/200
1/1 [==============================] - 0s 126ms/step - loss: 1.4116
Epoch 110/200
1/1 [==============================] - 0s 117ms/step - loss: 1.4077
Epoch 111/200
1/1 [==============================] - 0s 116ms/step - loss: 1.4074
Epoch 112/200
1/1 [==============================] - 0s 116ms/step - loss: 1.4031
Epoch 113/200
1/1 [==============================] - 0s 123ms/step - loss: 1.4028
Epoch 114/200
1/1 [==============================] - 0s 113ms/step - loss: 1.3986
Epoch 115/200
1/1 [==============================] - 0s 116ms/step - loss: 1.3979
Epoch 116/200
1/1 [==============================] - 0s 120ms/step - loss: 1.3944
Epoch 117/200
1/1 [==============================] - 0s 140ms/step - loss: 1.3938
Epoch 118/200
1/1 [==============================] - 0s 131ms/step - loss: 1.3902
Epoch 119/200
1/1 [==============================] - 0s 118ms/step - loss: 1.3892
Epoch 120/200
1/1 [==============================] - 0s 127ms/step - loss: 1.3860
Epoch 121/200
1/1 [==============================] - 0s 120ms/step - loss: 1.3846
Epoch 122/200
1/1 [==============================] - 0s 124ms/step - loss: 1.3824
Epoch 123/200
1/1 [==============================] - 0s 115ms/step - loss: 1.3797
Epoch 124/200
1/1 [==============================] - 0s 113ms/step - loss: 1.3788
Epoch 125/200
1/1 [==============================] - 0s 133ms/step - loss: 1.3761
Epoch 126/200
1/1 [==============================] - 0s 116ms/step - loss: 1.3734
Epoch 127/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3726
Epoch 128/200
1/1 [==============================] - 0s 122ms/step - loss: 1.3708
Epoch 129/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3674
Epoch 130/200
1/1 [==============================] - 0s 113ms/step - loss: 1.3655
Epoch 131/200
1/1 [==============================] - 0s 125ms/step - loss: 1.3648
Epoch 132/200
1/1 [==============================] - 0s 119ms/step - loss: 1.3630
Epoch 133/200
1/1 [==============================] - 0s 130ms/step - loss: 1.3609
Epoch 134/200
1/1 [==============================] - 0s 119ms/step - loss: 1.3578
Epoch 135/200
1/1 [==============================] - 0s 113ms/step - loss: 1.3552
Epoch 136/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3528
Epoch 137/200
1/1 [==============================] - 0s 116ms/step - loss: 1.3509
Epoch 138/200
1/1 [==============================] - 0s 124ms/step - loss: 1.3492
Epoch 139/200
1/1 [==============================] - 0s 114ms/step - loss: 1.3478
Epoch 140/200
1/1 [==============================] - 0s 117ms/step - loss: 1.3478
Epoch 141/200
1/1 [==============================] - 0s 134ms/step - loss: 1.3487
Epoch 142/200
1/1 [==============================] - 0s 118ms/step - loss: 1.3542
Epoch 143/200
1/1 [==============================] - 0s 119ms/step - loss: 1.3429
Epoch 144/200
1/1 [==============================] - 0s 126ms/step - loss: 1.3372
Epoch 145/200
1/1 [==============================] - 0s 118ms/step - loss: 1.3386
Epoch 146/200
1/1 [==============================] - 0s 117ms/step - loss: 1.3349
Epoch 147/200
1/1 [==============================] - 0s 137ms/step - loss: 1.3312
Epoch 148/200
1/1 [==============================] - 0s 124ms/step - loss: 1.3313
Epoch 149/200
1/1 [==============================] - 0s 124ms/step - loss: 1.3283
Epoch 150/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3251
Epoch 151/200
1/1 [==============================] - 0s 117ms/step - loss: 1.3242
Epoch 152/200
1/1 [==============================] - 0s 118ms/step - loss: 1.3220
Epoch 153/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3190
Epoch 154/200
1/1 [==============================] - 0s 125ms/step - loss: 1.3172
Epoch 155/200
1/1 [==============================] - 0s 116ms/step - loss: 1.3157
Epoch 156/200
1/1 [==============================] - 0s 114ms/step - loss: 1.3139
Epoch 157/200
1/1 [==============================] - 0s 138ms/step - loss: 1.3112
Epoch 158/200
1/1 [==============================] - 0s 129ms/step - loss: 1.3087
Epoch 159/200
1/1 [==============================] - 0s 122ms/step - loss: 1.3063
Epoch 160/200
1/1 [==============================] - 0s 135ms/step - loss: 1.3042
Epoch 161/200
1/1 [==============================] - 0s 128ms/step - loss: 1.3021
Epoch 162/200
1/1 [==============================] - 0s 118ms/step - loss: 1.3002
Epoch 163/200
1/1 [==============================] - 0s 117ms/step - loss: 1.2992
Epoch 164/200
1/1 [==============================] - 0s 126ms/step - loss: 1.3038
Epoch 165/200
1/1 [==============================] - 0s 127ms/step - loss: 1.3426
Epoch 166/200
1/1 [==============================] - 0s 131ms/step - loss: 1.2966
Epoch 167/200
1/1 [==============================] - 0s 121ms/step - loss: 1.3048
Epoch 168/200
1/1 [==============================] - 0s 132ms/step - loss: 1.3118
Epoch 169/200
1/1 [==============================] - 0s 119ms/step - loss: 1.3086
Epoch 170/200
1/1 [==============================] - 0s 126ms/step - loss: 1.2925
Epoch 171/200
1/1 [==============================] - 0s 129ms/step - loss: 1.3108
Epoch 172/200
1/1 [==============================] - 0s 125ms/step - loss: 1.2932
Epoch 173/200
1/1 [==============================] - 0s 128ms/step - loss: 1.3093
Epoch 174/200
1/1 [==============================] - 0s 120ms/step - loss: 1.3010
Epoch 175/200
1/1 [==============================] - 0s 122ms/step - loss: 1.2865
Epoch 176/200
1/1 [==============================] - 0s 124ms/step - loss: 1.2922
Epoch 177/200
1/1 [==============================] - 0s 129ms/step - loss: 1.2933
Epoch 178/200
1/1 [==============================] - 0s 117ms/step - loss: 1.2805
Epoch 179/200
1/1 [==============================] - 0s 123ms/step - loss: 1.2801
Epoch 180/200
1/1 [==============================] - 0s 117ms/step - loss: 1.2833
Epoch 181/200
1/1 [==============================] - 0s 141ms/step - loss: 1.2761
Epoch 182/200
1/1 [==============================] - 0s 119ms/step - loss: 1.2701
Epoch 183/200
1/1 [==============================] - 0s 136ms/step - loss: 1.2745
Epoch 184/200
1/1 [==============================] - 0s 148ms/step - loss: 1.2691
Epoch 185/200
1/1 [==============================] - 0s 189ms/step - loss: 1.2636
Epoch 186/200
1/1 [==============================] - 0s 205ms/step - loss: 1.2666
Epoch 187/200
1/1 [==============================] - 0s 210ms/step - loss: 1.2621
Epoch 188/200
1/1 [==============================] - 0s 195ms/step - loss: 1.2572
Epoch 189/200
1/1 [==============================] - 0s 196ms/step - loss: 1.2592
Epoch 190/200
1/1 [==============================] - 0s 189ms/step - loss: 1.2538
Epoch 191/200
1/1 [==============================] - 0s 192ms/step - loss: 1.2525
Epoch 192/200
1/1 [==============================] - 0s 212ms/step - loss: 1.2520
Epoch 193/200
1/1 [==============================] - 0s 193ms/step - loss: 1.2468
Epoch 194/200
1/1 [==============================] - 0s 189ms/step - loss: 1.2472
Epoch 195/200
1/1 [==============================] - 0s 199ms/step - loss: 1.2443
Epoch 196/200
1/1 [==============================] - 0s 193ms/step - loss: 1.2412
Epoch 197/200
1/1 [==============================] - 0s 208ms/step - loss: 1.2408
Epoch 198/200
1/1 [==============================] - 0s 194ms/step - loss: 1.2366
Epoch 199/200
1/1 [==============================] - 0s 190ms/step - loss: 1.2362
Epoch 200/200
1/1 [==============================] - 0s 190ms/step - loss: 1.2330
Out[24]:
<keras.callbacks.History at 0x7db0ab4610f0>
In [25]:
plt.plot(h.history["loss"], label="Loss")
#plt.plot(h.history["val_loss"], label="Val_Loss")
plt.yscale("log")
plt.legend()
Out[25]:
<matplotlib.legend.Legend at 0x7db0ab460e50>
No description has been provided for this image
In [26]:
for i in range(26):
    v = model.predict([X_test[i:i+1], X_test[i:i+1]]) #Can't be done as output not necessarely 1
    idxs = np.argmax(v, axis=2)
    pred=  "".join([int_to_char[h] for h in idxs[0]])[:-1]
    idxs2 = np.argmax(X_test[i:i+1], axis=2)
    true =  "".join([int_to_char[k] for k in idxs2[0]])[1:]
    if true != pred:
      print(true,pred)
1/1 [==============================] - 1s 1s/step
CN1CC[C@@H](Nc2ncc(c3C=C(C)C(=O)Nc23)c4cncc(C)c4)[C@@H](C1)OCC5CCS(=O)(=O)CC5EEE CCCCCCCCCCCCCCccccccccc))))C)C))CCccccccccccccccccCCEECECCCCCEEEEEEEEEE=))C))))E
1/1 [==============================] - 0s 57ms/step
CCN1C=C(c2cccc(c2)C(F)(F)F)c3sc(cc3C1=O)C(=N)NC4CCS(=O)(=O)CC4EEEEEEEEEEEEEEEEEE CCCCCCCCCcccccccccc)))CC)c)cccccccccc)))CCCCCCCCCCC(CO)(=O)cEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 45ms/step
CN1C=C(c2cccc(c2)C#N)c3sc(cc3C1=O)C(=N)NC4CCS(=O)(=O)CC4EEEEEEEEEEEEEEEEEEEEEEEE CCCCCCCCcccccccccc)c)Cccccccccc)))CCCCCCCCCCC(CC)C=O)c)EEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 55ms/step
COc1cc(ccc1S(=O)(=O)Nc2ccc3N(C)C(=O)C(=Cc3c2)C)C#NEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCccccccccccC))CC))c)cccccccCCCCCC)CCCC)cccccCCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 53ms/step
CC(C)CS(=O)(=O)N[C@H]1CCC(=O)N([C@@H]1c2ccc(Cl)cc2)c3ccc4C(=CC(=O)N(C)c4c3)CEEEE CCCCCcC(CCCccc)ccCCCCCCCCCCCCCCCCCCCCCCccccccc)CcccccccccccCC)CCC)CCC)CcccccEEEE
1/1 [==============================] - 0s 47ms/step
CC(=O)c1cc(c2cc(ccc2C)C(=O)NC3CC3)c4ncccn14EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCccccccccccccccccc)c)CC)CCCCCCCCcccccccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 57ms/step
CC(=O)c1cc(c2cccc(C)n2)c3cc(ccn13)N4CCOCC4EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCcccccccccccccccc)cccccccccccccccCCCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 47ms/step
CC(C)CS(=O)(=O)N[C@H]1CCC(=O)N([C@@H]1c2ccc(Cl)cc2)c3ccc4C(=CC(=O)N(C)c4c3)CEEEE CCCCCcC(CCCccc)ccCCCCCCCCCCCCCCCCCCCCCCccccccc)CcccccccccccCC)CCC)CCC)CcccccEEEE
1/1 [==============================] - 0s 51ms/step
CC(N1CCC(CC1)N(C)C(=O)C2=CN(CC=C)C(=O)c3[nH]ccc23)c4ccccc4EEEEEEEEEEEEEEEEEEEEEE CCCCCCCCCCCCCCCCCCCCC)C)CC)CCCCC)CCC))CcccccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 52ms/step
CC(N1CCC(CC1)N(C)C(=O)C2=CN(C)C(=O)c3[nH]ccc23)c4ccccc4EEEEEEEEEEEEEEEEEEEEEEEEE CCCC(CCCCCCCCCCCCCCCC)C)CC)CCCCCCC)CcccccccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 50ms/step
CN(C)c1ncc2C(=O)N(C)C=Cc2n1EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCcccccccccCC)CCCCCCC)cccCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 53ms/step
CCCOc1ccn2c(cc(c3cccc(OC)c3)c2c1)C(=O)CEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCcccccccccccccccccccc)ccccccccCCCCCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 48ms/step
COc1cc2N(C)C(=O)C=C(C)c2cc1NS(=O)(=O)c3ccc(cc3)C#NEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCccccccCCCCCCCCCC)C)CcccccCCCC)CC))cccccccccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 50ms/step
COc1ccccc1C(=O)Nc2cc3C=C(C)C(=O)N(C)c3cc2N4CCCC4EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCcccccccccCC)CCcccccc))))CCC))C)C)CccccccCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 50ms/step
COc1cc2N(C)C(=O)C(=Cc2cc1NS(=O)(=O)c3ccc(cc3OC)C#N)CEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCccccccCCCCCCCCCCC)ccccc)(CC)CC))ccccccccccc)CCCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 56ms/step
CC(=O)c1cc(c2ccccc2S(=O)(=O)C)c3ccccn13EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCcccccccccccccccccc)))))))c)ccccccccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 54ms/step
COc1ccc(cc1OC)C2=CN(C)C(=O)c3cc(sc23)C(=O)N4CCN(CC4)S(=O)(=O)CEEEEEEEEEEEEEEEEEE CCCccccccccccCCCCCCCCCCCCC)CccccccccccCCCCCCCCCCCCCCC(CO)(=O)cEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 47ms/step
CCCOc1ccc2c(c1)c(cn2C(=O)C)C3=C(C)N(NC3=O)C(=O)c4ccc(OC)cc4EEEEEEEEEEEEEEEEEEEEE CCCCcccccccccccCccccc))))CCCCCCCCCCCCCCCC)CCCC)CEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 60ms/step
C[C@@H]1CC(=O)Nc2cccc(N3CCCc4cc(c5cnn(C)c5)c(cc34)C(F)F)c2N1EEEEEEEEEEEEEEEEEEEE CCCCCCCCCCCCCCCCccccccccc))))CcccccccccccccccccccccCCCCCCccEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 57ms/step
CN1CCCC(C1)NC2=C(Cl)C(=O)N(C)N=C2EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCCCCCCCCCCCCCCCC)C(C)))(C)CCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 49ms/step
Cc1nnc2c3ccccc3c(nn12)c4ccc(N5CCOCC5)c(NS(=O)(=O)c6ccc(Cl)cc6)c4EEEEEEEEEEEEEEEE CCcccccccccccccccccccccccccccCCCCCCCCCCCC((O))=O)ccccccccEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 50ms/step
COc1ccc(cc1OC)C2=CN(C)C(=O)c3cc(sc23)C(=O)NC4CCS(=O)(=O)CC4EEEEEEEEEEEEEEEEEEEEE CCCccccccccccCCCCCCCCCCCCC)CccccccccccCCCCCCCCCC(CC)CCO)C)EEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 49ms/step
CN1C[C@H](C[C@H](C1)c2ccccc2)NC3=C(Cl)C(=O)N(C)N=C3EEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCCCCCCCCCCCCCCCCCCccccccccccCCCCCCCCCCC)CCCCCCCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 58ms/step
CNC(=N)c1cc2C(=O)N(C)C=C(c3ccc(OC)c(OC)c3)c2s1EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCCC(CccccccCC)CCC)CCC)Cccccccc)ccc))CccccccCEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
1/1 [==============================] - 0s 57ms/step
CC(C)CS(=O)(=O)N[C@H]1CCC(=O)N([C@@H]1c2ccc(Cl)cc2)c3ccc4C(=CC(=O)N(C)c4c3)CEEEE CCCCCcC(CCCccc)ccCCCCCCCCCCCCCCCCCCCCCCccccccc)CcccccccccccCC)CCC)CCC)CcccccEEEE
1/1 [==============================] - 0s 66ms/step
CN1C(=O)C=Cc2cc(NS(=O)(=O)c3ccc(cc3)C#N)ccc12EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE CCCCCCCCcCCCccccc(()))))))ccccccccccccc)CcccccEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
In [27]:
data_training ='/drive/My Drive/smiles'
smifile = data_training + '/Results.txt'
data = pd.read_csv(smifile, delimiter = "\t", index_col=False)
data.head()
Out[27]:
Real_Smlies predicted_Smlies
0 CN1CC[C@@H](Nc2ncc(c3C=C(C)C(=O)Nc23)c4cncc(C)...
1 CCN1C=C(c2cccc(c2)C(F)(F)F)c3sc(cc3C1=O)C(=N)N...
2 CN1C=C(c2cccc(c2)C#N)c3sc(cc3C1=O)C(=N)NC4CCS(...
3 COc1cc(ccc1S(=O)(=O)Nc2ccc3N(C)C(=O)C(=Cc3c2)C...
4 CC(C)CS(=O)(=O)N[C@H]1CCC(=O)N([C@@H]1c2ccc(Cl...