Introducing

Bulk Document Insert, Labeling, and Launching a Training Job

Let's explore how we can leverage our REST API to save documents, compute document embeddings using a combination of pretrained open-source language models, and generate custom embeddings for your documents.

To begin, we will load and process a dataset of Reuters articles. This dataset will serve as the foundation for our demonstrations.

Next we will use the API's bulk-insert endpoint which saves these documents to a database. During this process, the embeddings service will generate embeddings for each document using a pretrained embedding model that we have deployed.

One of our goals is to train a model that maps our initial document embeddings to custom embeddings that are tailored to our data. To achieve this, We will provide labels for pairs of documents, indicating whether their embeddings should be close or far apart from each other. To provide and save these labels, we will utilize The API's label-document-pair endpoint.

After the labeled dataset is created, we will launch a job to train a model that maps the initial embeddings to custom embeddings. This will involve sending a request to the train-embedding-matrix endpoint, providing configuration details of how we want this training job to run. This kicks off an asynchronous job that is executed by a compute instance other than the Operations API. The resulting model that performs best on our test set will be saved to object storage and utilized to produce custom embeddings whenever new documents are inserted.

Lastly, send a request to the update-custom-embeddings endpoint to update all the custom embeddings in the Reuters project, ensuring that our embeddings were all calculated using the latest custom embedding model.

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import requests
import os

df = pd.read_csv('reuters/reuters.csv', encoding = "ISO-8859-1")

We're only interested in a subset of these articles

Filter to only include articles about housing, income, jobs, retail, and CPI

housing = df['topic.housing']==1
income = df['topic.income']==1
jobs = df['topic.jobs']==1
retail = df['topic.retail']==1
cpi = df['topic.cpi']==1

df = df[housing | income | jobs | retail | cpi].loc[
    :, 
    ['doc.title', 'doc.text', 'topic.housing', 'topic.income', 'topic.jobs', 'topic.retail', 'topic.cpi']
]
df = df[df['doc.text'].notna()]

Exclude articles that are about more than one topic

This leaves us with 193 articles

def drop_multitopic_articles(df, cols):
    """
    example col index outptu
    """
    dfs = []
    indexes_to_remove = []
    for i in range(len(cols)-1):
        for j in range(i+1, len(cols)):
            condition1 = df[cols[i]]==1
            condition2 = df[cols[j]]==1
            indexes_to_remove.extend(df[(condition1 & condition2)].index.values)
            
    return df.drop(indexes_to_remove)

cols = ['topic.housing', 'topic.income', 'topic.jobs', 'topic.retail', 'topic.cpi']
df = drop_multitopic_articles(df, cols)
df['label'] = df[cols].idxmax(axis=1).str[6:]

View the number of articles in each category

df['label'].value_counts()
cpi        91
jobs       58
retail     19
housing    17
income      8
Name: label, dtype: int64

Note on our training set up

Notice that our data is in the form [article, label (category)], but to train a model to compute custom embeddings, we will need training data in the form [article_1, article_2, label] with label of 1 indicating the two articles are similar, and label of -1 indicating the two articles are different. To accomplish this, we will use a technique called weak supervision, heuristically generating training data of this form. Our rule will be that if two articles belong to the same category of news, we will label them as similar (with label = 1). Otherwise, we label them as different (with label = -1). We will also need to be careful when generating data in this form. Currently we have no positive or negative labels, so all our data with be synthetically generated. We don't want any articles to show up in both the training and test sets because this will artifically cause the evaluation metrics on our test set to be better than on unseen data.

We will need data of this form because of the training apporach we're taking. We're training a model that takes two embeddings as input, multiplies those embeddings by a weight matrix to produce two custom embeddings. Then the model computes the cosine similarity between the two custom embeddings and uses that as the output. We want this cosine similarity to be 1 if the two inputs were embeddings of similar articles, and -1 otherwise. We use mean squared error as our loss, and use stochastic gradient descent to iteratively update our weights to reduce our loss. When training is complete, we extract the weight matrix and use that to compute custom embeddings for new documents. We don't use the entire model because (1.) we want the input to be one embedding (not two) and (2.) We want the output to be the custom embedding (not the cosine similarity between two embeddings).

Split the articles into train and test sets.

Split before generating synthetic positives and negative examples so that articles can't show up in both train and test sets.

from sklearn.model_selection import train_test_split

test_fraction = 0.5
random_seed = 100
train_df, test_df = train_test_split(
    df, test_size=test_fraction, stratify=df['label'], random_state=random_seed
)
train_df.loc[:, 'dataset'] = 'train'
test_df.loc[:, 'dataset'] = 'test'

View the number of articles in each category in both the train and test datasets

train_df['label'].value_counts()
cpi        45
jobs       29
retail      9
housing     9
income      4
Name: label, dtype: int64
test_df['label'].value_counts()
cpi        46
jobs       29
retail     10
housing     8
income      4
Name: label, dtype: int64

Generate synthetic positive and negative document pair labels for both the training and test splits

  • Documents in the same category get a pair label of 1
  • Document in different categories get a pair label of -1
def gen_positives_dataframe(
    dataframe_of_labeled_documents: pd.DataFrame, 
    dataset: str
) -> pd.DataFrame:
    """
    This function takes a dataframe (dataframe_of_label_documents) where each row corresponds to a Reuters article. 
    Each row contains the text of the article in the "doc.text" column, the category of the article in the "label" column,
    and the dataset of the article in the "dataset" column.
    
    The goal of this function to provide the set of all positive pairs of articles,
    where a positive pair is a pair of articles that belong to the same category.
    
    Each element of the positive set of pairs corresponds to a row in a dataframe in the output of this function.
    
    The "dataset" column is assigned using the provided dataset name.
    """
    
    positive_pairs = set()
    for l in dataframe_of_labeled_documents['label'].unique():
        temp_df = dataframe_of_labeled_documents[dataframe_of_labeled_documents['label']==l].copy()
        texts = set(temp_df['doc.text'].unique())
        positive_pairs_for_label = {(t1, t2) for t1 in texts for t2 in texts if t1 < t2}
        positive_pairs.update(positive_pairs_for_label)
        
    df_of_positives = pd.DataFrame(list(positive_pairs), columns=['document_1', 'document_2'])
    df_of_positives['label'] = 1
    df_of_positives['dataset'] = dataset
    
    return df_of_positives

def gen_negatives_dataframe(
    dataframe_of_labeled_documents: pd.DataFrame, 
    dataframe_of_positives: pd.DataFrame, 
    dataset: str
) -> pd.DataFrame:
    """
    This function takes a dataframe (dataframe_of_label_documents) where each row corresponds to a Reuters article.
    Each row contains the text of the article in the "doc.text" column, the category of the article in the "label" column,
    and the dataset of the article in the "dataset" column.
    
    The dataframe_of_positives is a DataFrame generated by the gen_positives_dataframe function,
    which contains the set of all positive pairs given the DataFrame of labeled documents.
    
    The goal of this function is to provide the set of all negative pairs of articles,
    where a negative pair is a pair of articles that belong to separate categories.
    
    We compute the set of negative pairs by taking the set difference of all possible pairs and all positive pairs.
    We assign a label of -1 to each pair in this set and assign the "dataset" column using the provided dataset name.
    """
    texts = set(dataframe_of_labeled_documents['doc.text'].values)
    all_pairs = {(t1, t2) for t1 in texts for t2 in texts if t1<t2}
    positive_pairs = set(
        tuple(text_pair)
        for text_pair in dataframe_of_positives[['document_1', 'document_2']].values
    )
    negative_pairs = all_pairs - positive_pairs
    df_of_negatives = pd.DataFrame(list(negative_pairs), columns=['document_1', 'document_2'])
    df_of_negatives['label'] = -1
    df_of_negatives['dataset'] = dataset
    
    return df_of_negatives
    
train_df_positives = gen_positives_dataframe(train_df, 'train')
test_df_positives = gen_positives_dataframe(test_df, 'test')
train_df_negatives = gen_negatives_dataframe(train_df, train_df_positives, 'train')
test_df_negatives = gen_negatives_dataframe(test_df, test_df_positives, 'test')

# include the same number of positive and negative samples in the training set
train_df = pd.concat([
    train_df_positives,
    train_df_negatives.sample(
        n=len(train_df_positives),
        random_state=random_seed
    )
])

# include the same number of positive and negative samples in the test set
test_df = pd.concat([
    test_df_positives,
    test_df_negatives.sample(
        n=len(test_df_positives),
        random_state=random_seed
    )
])

# only use a subset of documents. 500 in train, 500 in test
train_df = train_df.sample(
    n=500,
    random_state=random_seed
)
test_df = test_df.sample(
    n=500,
    random_state=random_seed
)

# concatenate the train and test dataframes
df = pd.concat([train_df, test_df])
df.head()
document_1 document_2 label dataset
836 West German retail turnover rose a real one pc... colombia's cost of living index rose 2.03 pct ... -1 train
1294 Australia's economy should manage modest growt... French inflation slowed in February to between... 1 train
1268 Commerce Secretary Malcolm Baldrige predicted ... U.S. Completions of new homes fell 0.2 pct in ... 1 train
2853 Japan's consumer price index (base 1985) was u... U.S. economic data this week could be the key ... -1 train
294 Bolivia is to make a formal offer during the n... Inflation in Turkey was 3.7 pct in March compa... 1 train

Insert documents using the bulk insert endpoint

Use the bulk-insert-document endpoint to save documents in batches of 5. We could use larger batch sizes depending on the amount of memory that the Operations API has. Each document is saved to a database and has an embedding computed. Specify that these documents belong to the 'reuters' project.

OPERATIONS_API_BASE_URL = '[OPERATIONS_API_BASE_URL]'
BULK_INSERT_URI = OPERATIONS_API_BASE_URL + 'embeddings/bulk-insert-document'
PAIR_LABEL_URI = OPERATIONS_API_BASE_URL + 'embeddings/label-document-pair-operation'

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

s1_vals = df['document_1'].unique()
s2_vals = df['document_2'].unique()
all_sentences = np.concatenate((s1_vals, s2_vals))
all_sentences_unique = np.unique(all_sentences)

# Pod runs out of memory if not done in small batches
for sentences in chunker(all_sentences_unique, 5):
    # convert sentences from ndarray to list so they can be passed in request body
    sentences = list(sentences)
    op_input = {
        'documents': sentences,
        'model_name': 'default_embedding_model',
        'model_version': 'latest',
        'project': 'reuters'
    }
    response = requests.post(BULK_INSERT_URI, json=op_input)

Label the document pairs

Iterate over all our DataFrame of document pair labels. Each row contains two documents, the label, and which split (train or test) the pair belongs to. For each row, send a request that saves this labeled dataset to a database so it can be used in the training job we launch later. Specify that these labels belong to the 'reuters' project.

for i, row in df.iterrows():
    op_input = {
        'document_1': row['document_1'],
        'document_2': row['document_2'],
        'project': 'reuters',
        'label': row['label'],
        'split': row['dataset']
    }
    response = requests.post(PAIR_LABEL_URI, json=op_input)

Launch Training Job

The train-embeddings-matrix endpoint is a bit different than the other endpoints. Instead of waiting for the training job to finish, it kicks off an asynchronous job that we can monitor given its invocation_id. This training job is not executed by the operations api, it is ran in a temporary pod that is spun up specifically for this job, and goes away once the job is completed.

The training job operation takes the following inputs

  • embedding_length: the length of the initial_embeddings of documents in the given project. This will depend on which model you used to compute your initial embeddings.
  • custom_embedding_length: length of the custom embeddings to compute. This is used to set number of columns in the embedding weight matrix.
  • dropout_rate: This will set the dropout rate in the dropout layer that is applied during training to both inputs before multiplying the inputs by the weight matrix.
  • project: name of the project to train a custom embedding model for. This is used as a filter to only retrieve document pair labels in a given project. It is also used to associate the trained model to be used only when inserting new documens in a given project.
  • hyperparameter_search_space: tuples of (batch_size, learning_rate) to try out during training. Each tuple will produce max_epochs candidate models to use. The model that performs the best on the test set will be saved. Note - you can customize this operation to allow for combinations of more hyperparameters.
  • max_epochs: number of epochs to use during training for each combination of hyperparameters.

When the training job is done, the model with the best performance on the test set is chosen. The embedding matrix of that model is saved as a .npy file. An Artifact record is created for that file and the file is uploaded to object storage using the default storage backend. The Artifact created is associated with the project given as input to the training job. When new documents are inserted using either the insert-document or bulk-document-insert operations, the Artficact module is used to retrieve that file and use it to compute a custom embedding by multiplying the initial embedding of a document by the embedding matrix. The custom embeding is saved alongside each document inserted this way.

import requests

TRAIN_URL = OPERATIONS_API_BASE_URL + 'embeddings/train-embedding-matrix'

op_input = {
    'embedding_length': 768,
    'custom_embedding_length': 2048,
    'dropout_rate': 0.0,
    'project': 'reuters',
    'hyperparameter_search_space': {
        'data': [
            {'batch_size': 32, 'learning_rate': 32},
            {'batch_size': 64, 'learning_rate': 64}
        ]
    },
    'max_epochs': 32
}

response = requests.post(TRAIN_URL, json=op_input)
response.json()
{'invocation_id': '0d29bb86413645d8867e35840d96a75d'}

Now a custom embedding is computed when a document is inserted into the Reuters project

INSERT_DOC_URL = OPERATIONS_API_BASE_URL + 'embeddings/insert-document-operation'
op_input = {
    'document': """WASHINGTON, June 29 (Reuters) - The U.S. State Department has approved the sale of F-35 fighter jets, munitions and related equipment to the Czech Republic in a deal valued at up to $5.62 billion, the Pentagon said on Thursday.
The Pentagon's Defense Security Cooperation Agency notified Congress of the possible sale on Thursday. The principal contractors will be Lockheed Martin (LMT.N), Raytheon (RTX.N) and Boeing (BA.N).
Last year, the Czech government said it wanted to buy 24 F-35 jets to replace leased Gripen fighters from Sweden's Saab AB (SAABb.ST).
The package approved by the State Department would also include a spare engine, 70 AIM-120C-8 Advanced Medium Range Air-to-Air Missiles (AMRAAMs), various bombs, support equipment, spares and technical support, the Pentagon said.
Despite approval by the State Department, the notification does not indicate that a contract has been signed or that negotiations have concluded.""",
    
    'project': 'reuters',
    'model_name': 'default_embedding_model',
    'model_version': 'latest'
}
response = requests.post(INSERT_DOC_URL, json=op_input)
out = response.json()

print('document_id:', out['data']['document_id'])
print('document', out['data']['document'])
print('initial_embedding', out['data']['initial_embedding'][:5])
print('custom_embedding', out['data']['custom_embedding'][:5])
document_id: 462e4cb881eb44e6a76719a289fb5f66
document WASHINGTON, June 29 (Reuters) - The U.S. State Department has approved the sale of F-35 fighter jets, munitions and related equipment to the Czech Republic in a deal valued at up to $5.62 billion, the Pentagon said on Thursday.
The Pentagon's Defense Security Cooperation Agency notified Congress of the possible sale on Thursday. The principal contractors will be Lockheed Martin (LMT.N), Raytheon (RTX.N) and Boeing (BA.N).
Last year, the Czech government said it wanted to buy 24 F-35 jets to replace leased Gripen fighters from Sweden's Saab AB (SAABb.ST).
The package approved by the State Department would also include a spare engine, 70 AIM-120C-8 Advanced Medium Range Air-to-Air Missiles (AMRAAMs), various bombs, support equipment, spares and technical support, the Pentagon said.
Despite approval by the State Department, the notification does not indicate that a contract has been signed or that negotiations have concluded.
initial_embedding [-0.009968794882297516, 0.14886946976184845, 0.12062198668718338, 0.05534030497074127, -0.0002789426071103662]
custom_embedding [1.3544219437736782, 0.10748239744947335, -0.7904695454790299, -2.7804214761533332, -5.775485003665196]

Update the custom embeddings of all documents already in the Reuters project

UPDATE_EMBEDDINGS_URL = OPERATIONS_API_BASE_URL + 'embeddings/update-custom-embeddings'
op_input = {'project': 'reuters'}
response = requests.post(UPDATE_EMBEDDINGS_URL, json=op_input)
response.json()
{'invocation_id': 'e04b1827ddea436c8d2b5d74705b7670'}

Recent Publications

Blog

Guidepad's ML Plugin

The guidepad-ML plugin is an extension of the guidepad platform that helps users with their end-to- end ML lifecycle.

Tommy O'Keefe

Jul 28, 2023 · 10 min read read

Blog

Guidepad's Managed Embeddings Service (Part 1)

This demo showcases the capabilities of our embeddings service. This notebook will interact with a set of APIs we offer, showing that the embeddings service can be utilized by any downstream application with internet access, or any user with their preferred programming language.

Tommy O'Keefe

Aug 8, 2023 · 10 min read read

Blog

Guidepad's Managed Embeddings Service (Part 2)

Let's explore how we can leverage our REST API to save documents, compute document embeddings using a combination of pretrained open-source language models, and generate custom embeddings for your documents.

Tommy O'Keefe

Aug 8, 2023 · 10 min read read