Machine Learning Notes

Embeddings

Framework for creating and deploying ML
Three main tasks
- Create a TensorFlow Dataset object from token generator
- Create nural network to train embeddings
- Train the network
tf.data.Dataset.from_generator()
- First argument is token generator function
- Second argument is an output signature
  - Must be a tuple of tf.TensorSpec() objects
Need to batch dataset to input into neural network
- ds = ds.batch(16)
- ds.take(1) returns one batch

Options for training a new model
- Fine tuning
  - Potentially very costly
  - Time consuming to train (or fine tune) a model
  - Low Rank Adaptation (LoRA) and Quantized Low Rank Adaptation (QLoRA) an option
    - Able to run on lower resource hardware
  - Possibility for catastrophic forgetting
- Freeze, add adapter layers, and train adapter layers
  - Still susceptible to hallucinations
Sentence-based Bidirectional Encoder Representations from Transformers (SBERT)
- General idea is the same as training word embeddings
- sentence-transformers Python library contains several SBERT models (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html)
LangChain
- Text Splitters
  - https://python.langchain.com/docs/concepts/text_splitters/
  - RecursiveCharacterTextSplitter()
    - chunk_size - maximum length of text string output
    - chunk_overlap - how many characters overlap previous and next string
    - length_function - function name to determine length of chunk

Handling HTML data:
- https://beautiful-soup-4.readthedocs.io/en/latest/
- https://pypi.org/project/markdownify/
Extracting metadata from alerts
- Identify common alert types / categories based on Teams message title
- Identify metadata fields from alert body
- Create regular expressions to collect metadata from alert body
- Develop a function to automatically extract metadata

Data can be formatted several ways
- Flattened JSON object
- Markdown object
- Large blob of text