Scripts
ms_crawler.py
A script for scraping all messages, comments, and reactions from a Microsoft Teams channel and storing this data in JSON format.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/crawlers/ms_crawler.py.
Configuration
The configuration file for ms_crawler.py should be stored under ./config/config.cfg with the following structure:
[azure]
clientId =
tenantId =
graphUserScopes = User.Read Team.ReadBasic.All
[teams]
teamId =
channelId =
outputFile = output/message_archive.json
Command-line arguments
% PYTHONPATH=. python3 scripts/crawlers/ms_crawler.py -h
/Users/amerck/Projects/oms_practicum/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
usage: ms_crawler.py [-h] -c CONFIG
Crawls Microsoft Graph API for Teams messages.
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/crawlers/ms_crawler.py
You will receive the following prompt:
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ABCD12345 to authenticate.
Enter the URL in a browser, submit the code, and authenticate to Microsoft as you would normally. Once this is complete, the script should begin downloading data.
Data structure
{
"message": {
"attachments": [
{
"id": "Unique identifier for attachment",
"content": "Attachment body text"
"content_type": "Attachment data type",
}
],
"content_type": "Teams message data type",
"content": "Teams message body text",
"timestamp": "Message creation date and time in TZ format",
"user": "Message author",
"reactions": [
{
"reaction_name": "Name of emoji reaction",
"reaction_type": "Reaction emoji",
"timestamp": "Timestamp of reaction"
}
],
"replies": [
{
"content": "Reply body text",
"content_type": "Reply data type",
"timestamp": "Timestamp of reply",
"user": "Reply author",
"reactions": [
{
"reaction_name": "Name of emoji reaction",
"reaction_type": "Reaction emoji",
"timestamp": "Timestamp of reaction"
}
]
}
]
}
}
find_sn_ticket_numbers.py
A script for pulling a list of ServiceNow ticket numbers from any text file.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/crawlers/find_sn_ticket_numbers.py.
Command-line arguments
% python3 find_sn_ticket_numbers.py -h
usage: find_sn_ticket_numbers.py [-h] -i INPUT -o OUTPUT
Find ServiceNow ticket numbers in text file.
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Text file to search for ServiceNow ticket numbers
-o OUTPUT, --output OUTPUT
Output file to write the found ticket numbers
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/crawlers/find_sn_ticket_numbers.py -i ./some_text.json -o ticket_numbers.txt
sn_crawler.py
A script for pulling the contents of ServiceNow tickets from the SN API and stores the data in JSON format.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/crawlers/sn_crawler.py.
Configuration
The script requires a configuration file containing ServiceNow authentication parameters in the following format:
[service_now]
url=
username=
password=
- url: full base URL of ServiceNow instance
- username: Username of user with permissions to the ServiceNow API
- password: Password for user with permissions to the ServiceNow API
Command-line arguments
% python3 sn_crawler.py -h
usage: sn_crawler.py [-h] -c CONFIG -i INPUT -o OUTPUT
SN Crawler
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
ServiceNow API configuration file
-i INPUT, --input INPUT
Input file of ServiceNow tickets to retrieve
-o OUTPUT, --output OUTPUT
Output file to write ServiceNow ticket data to
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/crawlers/sn_crawler.py -c ./config/config.cfg -i ./ticket_numbers.txt -o ticket_output.json
web_crawler.py
A script for crawling a website and copying all HTML and binary files to disk.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/crawlers/web_crawler.py.
Initialization
After installing Playwright via pip, run the following command to install browsers:
playwright install
Configuration
The configuration file for web_crawler.py should be stored under ./config/config.cfg with the following structure:
[crawler]
domain =
auth_url =
auth_verification_url =
state_path = ./state.json
output_dir = ./archive
Description of the configuration options are as follows: * domain: The base domain name of the site you wish to crawl * auth_url: The URL to authenticate against prior to scanning * auth_verification_url: The URL that confirms authentication was successful * state_path: The path for the Playwright state file * output_dir: The path of the directory to write the HTML archive to
Command-line arguments
% PYTHONPATH=. python3 scripts/crawlers/web_crawler.py -h
usage: web_crawler.py [-h] -c CONFIG
Crawls Microsoft Graph API for Teams messages.
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/crawlers/web_crawler.py
html_flattener.py
A script for flattening the output of web_crawler.py for embedding.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/data_flattening/html_flattener.py.
Command-line arguments
% python3 html_flattener.py -h
usage: html_flattener.py [-h] -d IN_DIRECTORY -o OUTPUT
HTML Archive Flattener
optional arguments:
-h, --help show this help message and exit
-d IN_DIRECTORY, --in-directory IN_DIRECTORY
Directory of HTML files to flatten
-o OUTPUT, --output OUTPUT
Output filename
Running
In order to run the script, execute the following command:
% PYTHONPATH=. python3 scripts/data_flattening/html_flattener.py -d ./html_archive -o flattened_html.md
sn_flattener.py
A script for flattening the output of sn_crawler.py for embedding.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/data_flattening/sn_flattener.py.
Configuration
The script requires one template file compatible with string.Template().substitute().
ticket.template: Output format for ServiceNow tickets
Example:
# ServiceNow Ticket $number
* Created By: $sys_created_by
* Created On: $sys_created_on
* Opened By: $opened_by
* Opened At: $opened_at
* Priority: $priority
* Urgency: $urgency
* Impact: $impact
* Service Offering: $service_offering
* Service Provider: $u_service_provider
* IT Service: $u_it_service
* Application: $u_application
* Assigned To: $assigned_to
* Assignment Group: $assignment_group
* Closed At: $closed_at
## Ticket $number Short Description
$short_description
## Ticket $number Description
$description
## Ticket $number Work Notes
$close_notes
Command-line arguments
% python3 sn_flattener.py -h
usage: sn_flattener.py [-h] -i INPUT -o OUTPUT -t TEMPLATE_DIR
ServiceNow Ticket Flattener
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
JSON ServiceNow Ticket file from sn_crawler.py output
-o OUTPUT, --output OUTPUT
Output filename
-t TEMPLATE_DIR, --template-dir TEMPLATE_DIR
Directory containing output template files
Running
In order to run the script, execute the following command:
% PYTHONPATH=. python3 scripts/data_flattening/sn_flattener.py -i sn_tickets.json -o sn_output.md -t ./templates
teams_flattener.py
A script for flattening the output of ms_crawler.py for embedding.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/data_flattening/teams_flattener.py.
Configuration
The script requires three template files compatible with string.Template().substitute().
attachment.template: Output format for message attachmentsreply.template: Output format for message repliesmessage.template: Output format for flattened Teams messages
Example:
# Teams Message
* Subject: $subject
* Timestamp: $timestamp
* Sender: $sender
## Content
$content
$attachments
## Replies
$replies
Command-line arguments
% python teams_flattener.py -h
usage: teams_flattener.py [-h] -i INPUT -o OUTPUT -t TEMPLATE_DIR
Microsoft Teams Message Flattener
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
JSON Teams Message file from ms_crawler.py output
-o OUTPUT, --output OUTPUT
Output filename
-t TEMPLATE_DIR, --template-dir TEMPLATE_DIR
Directory containing output template files
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/data_flattening/teams_flattener.py -i ./message_archive_full.json -o ./flattened_output.md -t ./templates
store_embeddings.py
A script for storing text embeddings in vector database.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/embeddings/store_embeddings.py.
Configuration
The script requires a configuration file containing several parameters in the following format:
[vector_db]
host=
port=
collection=
[model]
model_name=
model_size=
[splitter]
chunk_size=
chunk_overlap=
- vector_db
- host: hostname of vector database (Required)
- port: port of vector database (Required)
- collection: collection name for data in vector database (Required)
- model
- model_name: name of the model used for generating embeddings (Required)
- model_size: size of the model (Required)
- splitter
- chunk_size: number of bytes to divide text into for chunking (Optional)
- chunk_overlap: number of bytes to overlap chunks (Optional)
Command-line arguments
% PYTHONPATH=. python3 scripts/embeddings/store_embeddings.py -h
/Users/amerck/Projects/oms_practicum/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
usage: store_embeddings.py [-h] -i INPUT -c CONFIG [--json]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Path to the Teams JSON file to store in vector database
-c CONFIG, --config CONFIG
Path to the config file
--json Handle input as JSON with metadata.
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/embeddings/store_embeddings.py -c config/config.cfg -i input_text.json --json
query_vector_db.py
A script for querying a vector database collection for similar text.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/embeddings/query_vector_db.py.
Configuration
The script requires a configuration file containing several parameters in the following format:
[vector_db]
host=
port=
collection=
[model]
model_name=
model_size=
- vector_db
- host: hostname of vector database (Required)
- port: port of vector database (Required)
- collection: collection name for data in vector database (Required)
- model
- model_name: name of the model used for generating embeddings (Required)
- model_size: size of the model (Required)
Command-line arguments
% PYTHONPATH=. python3 scripts/embeddings/query_vector_db.py -h
usage: query_vector_db.py [-h] -c CONFIG [prompt]
positional arguments:
prompt Query text
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Path to the config file
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/embeddings/query_vector_db.py -c config/config.cfg "What is information security?"
populate_graph_db.py
A script for populating the Graph database with Microsoft Teams alerts and ServiceNow tickets.
This script can be found at https://github.com/amerck/oms_practicum/blob/main/scripts/graph_db/populate_graph_db.py.
Configuration
The configuration file for populate_graph_db.py should use the following structure:
[graph_db]
uri=
username=
password=
Command-line arguments
% PYTHONPATH=. python3 scripts/graph_db/populate_graph_db.py -h
usage: populate_graph_db.py [-h] -c CONFIG --teams TEAMS --sn SN
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Path to the config file
--teams TEAMS Path to Teams Alert file
--sn SN Path to ServiceNow Ticket file
Running
In order to run the script, execute the following command:
PYTHONPATH=. python3 scripts/graph_db/populate_graph_db.py -c configs/graph.cfg --teams teams_flattened.json --sn sn_flattened.json