Skip to content

remove environment-specific manifests from public branch; #87

Open
mkuznet1 wants to merge 12 commits intoROCm:aicomnet_devfrom
mkuznet1:aicomnet_dev
Open

remove environment-specific manifests from public branch; #87
mkuznet1 wants to merge 12 commits intoROCm:aicomnet_devfrom
mkuznet1:aicomnet_dev

Conversation

@mkuznet1
Copy link
Copy Markdown

The .gitignore file has been restored, and the manifest files have been deleted

coketaste and others added 8 commits December 2, 2025 11:49
* Added therock model for testing TheRock image

* Added therock model

* Modified the Dockerfile of TheRock only install core runtime and hip runtime

* Fixe the generate-sys-env-details arg in mad

* Redsign the rocEnvTool to work with TheRock based image

* Keep compatible to the csv parser

* Fixed the csv parser

* Updated README of rocEnvTool accordingly
* Implemented a module to parse config inputs and creat perf_entry_super.json and upload dataset to MongoDB

* Implement update perf superset

* fix unit tests of super set

* Fixed the perf superset data collection and MongoDB update
Resolve merge conflicts by keeping refactor-dis (v2) and discarding
main (v1) changes:

- Remove src/madengine/mad.py and src/madengine/tools/run_models.py
  (deleted in v2, accept deletion over main's modifications)
- Resolve rocenv_tool.py conflict: keep current-branch version for
  unknown GPU device handling
- Resolve tests/fixtures/dummy/models.json: keep v2 fixture set
  (dummy_superset and full model list) over main's therock-only entry
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to clean up local/runtime artifacts by removing previously committed run-manifest JSON files and an environment file, and updating .gitignore to ignore additional generated content.

Changes:

  • Removed two committed run_manifest_*.json files under manifests/.
  • Removed manifests/mad.env (shell environment exports).
  • Updated .gitignore (adds *.json, and fixes formatting for .madengine_session_start).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json Removed a committed run-manifest JSON (likely local/generated).
manifests/run_manifest_primus_2node_qwen_localimage.json Removed a committed run-manifest JSON (likely local/generated).
manifests/mad.env Removed a committed environment export file (likely local setup).
.gitignore Ignores additional files (notably *.json) and normalizes an entry’s formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Introduced per-node artifact staging to a dedicated results directory.
- Implemented a mechanism to wait for all nodes to complete staging before merging results.
- Added logic to merge performance CSV files from multiple nodes, selecting the best file based on content.
- Updated the master node's result collection process to reflect these changes, ensuring comprehensive data aggregation.

This update aims to improve the reliability and accuracy of performance reporting in distributed SLURM runs.
- Removed redundant file-based synchronization mechanism for node readiness.
- Simplified the barrier waiting process by directly utilizing TCP for image readiness.
- Adjusted timeout handling to ensure consistent behavior across node synchronization.

This change enhances the efficiency of multi-node operations by streamlining the readiness check process.
- Added logic to prefer files with the most non-empty performance values during result aggregation.
- Implemented dynamic column index retrieval for the "performance" column in CSV files, ensuring accurate counting of non-empty performance entries.
- Maintained backward compatibility by falling back to the previous method if the performance column is not found.

This update aims to enhance the accuracy of performance metrics in multi-node training scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants