Onesmus DzidzaiMaenzanise
Cloud Infrastructure & Software Engineer
"I build and maintain production systems that cannot afford to fail."
Production Systems Overview
I operate in live production environments not isolated development sandboxes. My work centers on maintaining reliability, observability, and fault handling across distributed systems that process hundreds of thousands of transactions daily.
I specialize in stabilizing real-world systems under load, managing cross-organization integrations, and ensuring that every component from API gateways to database clusters performs predictably at scale.
My approach combines disciplined software engineering with infrastructure operations, treating production environments as the single source of truth.
Production Stack
Tools I operate daily in production environments. Every component is battle-tested under real traffic and real incidents.
Container orchestration across EKS clusters. Pod management, rollouts, resource scaling, and health checks in production.
GitOps delivery for all production deployments. Declarative sync policies, automated rollbacks, and application health monitoring.
Unified observability dashboards for system health, application performance, and infrastructure metrics across all environments.
Metrics collection and alerting for distributed systems. Custom alert rules, time-series analysis, and incident trigger pipelines.
Serverless compute for event-driven workloads. Integrated with API Gateway, SQS, and S3 for scalable production processing.
Infrastructure as Code for provisioning and managing cloud resources. Modular configurations, state management, and pipeline-driven deployments.
Capabilities
Each capability is a production-tested system module-battle-hardened across fintech and telecom environments handling real traffic and real failures.
- ▸AWS Lambda - serverless compute
- ▸Docker / Kubernetes - container orchestration
- ▸Cloud hosting - multi-environment deployments
- ▸CI/CD pipelines - automated build & deploy
- ▸Deployment automation - zero-downtime releases
- ▸Infrastructure troubleshooting - root cause analysis
- ▸Golang / Python / Node.js / Java - polyglot engineering
- ▸REST / GraphQL APIs - service interfaces
- ▸Integration systems - cross-platform connectivity
- ▸PostgreSQL / MySQL / MongoDB - relational & document stores
- ▸Performance tuning - query optimization, indexing
- ▸Data pipelines - ETL, streaming, batch processing
- ▸PCI DSS - compliant payment environments
- ▸OAuth2 / JWT - authentication & authorization
- ▸Secure system design - defense in depth
- ▸Incident response - real-time production triage
- ▸Root Cause Analysis - systematic failure investigation
- ▸Monitoring dashboards - metrics, logs, traces
- ▸Kubernetes cluster deployments and container orchestration in production
- ▸Deployment pipelines, rollback strategies, and uptime management
- ▸Pod scaling, resource management, and health checks
- ▸GitOps-based deployment workflows using ArgoCD
- ▸Declarative application delivery to production infrastructure
- ▸Sync policies, rollback management, and deployment observability
- ▸Grafana dashboards for system health and performance monitoring
- ▸Metrics visualization across distributed services
- ▸Alert configuration and incident response support
Experience
Intermediate Software Engineer
APS Holdings
- Supported and maintained cloud-based payment and integration platforms at production scale
- Second-line incident response: investigating failed transactions, degraded integrations, and performance issues
- Led integrations between internal platforms and external partner systems
- Built and maintained Spring Boot microservices for partner-facing API integrations
- Took technical ownership of cross-organisation integration workstreams involving Telecoms partners
- Worked within PCI DSS requirements, managing authentication, access control, and audit needs
- Contributed to CI/CD pipelines and containerized deployments
Junior Software Engineer
APS Holdings
- Developed and supported Golang-based backend services in live production environments
- Assisted with production incident investigations and post-incident fixes
- Worked with Kubernetes-based deployment pipelines, improving deployment speed and rollback reliability
- Supported operational dashboards and reporting used by finance and operations teams
Software Engineer
CartShare (Remote)
- Developed backend services and integrations for e-commerce payment and data systems
- Built and supported AWS Lambda-based workflows and analytics pipelines
- Implemented automated deployment workflows to reduce time-to-market and deployment errors
- Supported operational issues related to data consistency and transaction processing
Software Engineer in Training
WeThinkCode_
- Built full-stack applications using Python, Java, JavaScript, and SQL
- Worked in agile teams delivering production-style projects with testing and documentation expectations
- Achieved high test coverage and developed disciplined debugging and review practices
How I Think About Systems
Systems must fail safely, not silently.
A failure you know about is a failure you can respond to. Silent failures cascade into outages. Every component should degrade gracefully and report its state.
Observability is not optional.
You can’t fix something you can’t observe. Just like you need to have metrics, logs, and traces in order to maintain a production system, you also need to be able to see everything in order to determine if it is ready for production. If something is not observable, then it’s not yet appropriate for production.
Complexity must be controlled, not eliminated blindly.
Managed complexity is the goal of a distributed system; the distributed system will never be zero complexity, however, it will have clear boundaries, explicitly defined interfaces, and where you have made deliberate trade-offs.
Production is the only real environment.
Development & staging are approximations of what production is going to be like. Production is where your system will behave in the way you expect based on real load, real data & real failures. You should design your system with production in mind.
Reliability is a feature, not an afterthought.
Every architectural decision has implications for reliability. Adding redundancy, retry strategy, circuit breaker, back pressure, etc. are not optional 'add-ons' they are part of the core design requirements.
Education
B.S. Software Development
Expected 2026BYU-Idaho
Bachelor's degree program in Software Development, focused on software engineering principles, data structures, and system design.
National Certificate: Information Technology (Systems Development) NQF 5
Aug 2022 – Jan 2024WeThinkCode_
National certificate in systems development covering programming, system analysis, and software development methodologies. Completed through an intensive peer-led engineering program with emphasis on testing, agile delivery, debugging, and real-world software development practices.
Contact
Available for infrastructure-focused roles and production engineering opportunities.