Senior Software Engineer, Reliability Engineering (US)
Company: Onehouse
Location: Sunnyvale
Posted on: October 3, 2024
Job Description:
About OnehouseOnehouse is a mission-driven company dedicated to
freeing data from data platform lock-in. We deliver the industry's
most interoperable data lakehouse through a cloud-native managed
service built on Apache Hudi. Onehouse enables organizations to
ingest data at scale with minute-level freshness, centrally store
it, and make available to any downstream query engine and use case
(from traditional analytics to real-time AI / ML). We are a team of
self-driven, inspired, and seasoned builders that have created
large-scale data systems and globally distributed platforms that
sit at the heart of some of the largest enterprises out there
including Uber, Snowflake, AWS, Linkedin, Confluent and many more.
Riding off $33M total funding and a fresh Series A backed by
Greylock/Addition, we are quickly expanding and looking for rising
talent to grow with us and become future leaders of the team. Come
help us build the world's best fully managed and self-optimizing
data lake platform!The Community You Will JoinWhen you join
Onehouse, you're joining a team of passionate professionals
tackling the deeply technical challenges of building a 2-sided
engineering product. Our engineering team serves as the bridge
between the worlds of open source and enterprise: contributing
directly to and growing Apache Hudi (already used at scale by
global enterprises like Uber, Amazon, ByteDance etc) and
concurrently defining a new industry category - the transactional
data lake. The Reliability Engineering team is the glue that binds
all of this together. You will be responsible for developing and
maintaining the tools and systems that enable our engineering teams
to operate our services reliably and at scale. You will closely
cross functionally partner with our engineering teams to ensure our
services are able to scale with our growing business.The Impact You
Will Drive:
- At Onehouse, you will own our entire live production
infrastructure and operational posture to run massive data systems
at scale.
- Ensure our services remain resilient by identifying
opportunities for improvement and drive their implementation.
- Identify opportunities to improve our overall operational
efficiency and growing by owning the modern tools in our cloud-only
operation and our practices for proactive automation, monitoring
and response.
- Acting as a mentor to guide cross-functional teams during
crisis situations and ensure timely resolution, minimizing the
impact on our customers and business.A Typical Day:
- Build and own our reliability engineering practice from the
ground up, owning our entire production infrastructure and
operational posture.
- Establish a culture of reliability across engineering by
providing a comprehensive incident management platform that is
being used for instrumentation, operability, and around
incidents.
- Design, implement and maintain new services, tools, and
monitoring to support service reliability and alerting.
- Serve as an active member of our SRE team, responding to and
managing high severity incidents or any situations concerning the
wellbeing and continuous operation of our mission-critical
systems.
- Collaborate with your stakeholders across engineering teams to
ensure continuous adoption of best practices, rollout scenarios for
the space, and that services are designed with reliability in
mind.
- Continuously analyze and evaluate the tradeoffs of the existing
designs and make recommendations based on new technologies and
industry best practices.
- Support services before they go live through activities such as
system design consulting, developing software platforms and
frameworks, capacity management and launch reviews.
- Maintain services once they are live by measuring and
monitoring availability, latency and overall system health through
an intimate understanding of how the critical parts of our site
work.
- Contribute to better incident management posture and
retrospectives, driving improvements in our overall reliability and
incident response time as well as on-call runbooks and post-mortem
reports.
- Drive our compliance posture; ensuring that all our products
and processes comply with relevant regulations and standards,
especially during compliance audits.What You Bring to the Table:
- Bachelor's degree in Computer Science or related field.
- 7+ years of experience in software engineering or SRE roles,
with a focus on large scale distributed systems.
- Strong coding skills in at least one programming language, such
as Java, Python, or Go.
- Strong conviction in software development best practices,
including version control, automated testing, and continuous
integration and delivery.
- Excellent problem-solving, triaging, and debugging skills in
large-scale distributed systems.
- Experience with managing kubernetes clusters and applications
at scale.
- Experience deploying applications on one or more cloud
platforms such as AWS, Google Cloud Platform or Microsoft
Azure.
- Experience defining and owning reliability focussed systems and
processes (e.g. Incident Management, Post-mortem).
- Experience with software development related compliance
processes (e.g. Soc2, FedRAMP).
- Experience with the following tech stack:
- Infrastructure-as-code (e.g. Terraform, Cloudformation)
- Automation frameworks (e.g. Jenkins, CircleCI)
- Monitoring stacks (e.g. Prometheus and ELK)
- Cloud security management (e.g IAM, SSO)
- Data processing technologies like SparkHow We'll Take Care of
You -Competitive Compensation; the estimated base salary range for
this role is $150,000 - $220,000-Equity Compensation; our success
is your success with eligible participation in our company equity
plan-Health & Well-being; we'll invest in your physical and mental
well-being with up to 90% health coverage (50% for
spouses/dependents) including comprehensive medical, dental &
vision benefits-Financial Future; we'll invest in your financial
well-being by making this role eligible to contribute to our
company 401(k) or Roth 401(k) retirement plan -Location; we are a
remote-friendly company (internationally distributed across N.
America + India), though some roles will be subject to in-person
requirements in alignment with the needs of the business-Generous
Time Off; unlimited PTO (mandatory 1 week/year minimum), uncapped
sick days and 11 paid company holidays-Company Camaraderie; Annual
company offsites and Quarterly team onsites @Sunnyvale HQ-Food &
Meal Allowance; weekly lunch stipend, in-office
snacks/drinks-Equipment; we'll provide you with the equipment you
need to be successful and a one-time $500 stipend for your initial
desk setup-Child Bonding!; 8 weeks off for parents (birthing,
non-birthing, adoptive, foster, child placement, new guardianship)
- fully paid so you can focus your energy on your newest addition
House ValuesOne TeamOptimize for the company, your team, self - in
that order. We may fight long and hard in the trenches, take care
of your co-workers with empathy. We give more than we take to build
the one house, that everyone dreams of being part of.Tough &
Persevering We are building our company in a very large,
fast-growing but highly competitive space. Life will get tough
sometimes. We take hardships in the stride, be positive, focus all
energy on the path forward and develop a champion's mindset to
overcome odds. Always day one!Keep Making It Better AlwaysRome was
not built in a day; If we can get 1% better each day for one year,
we'll end up thirty-seven times better. This means being organized,
communicating promptly, taking even small tasks seriously, tracking
all small ideas, and paying it forward.Think Big, Act FastWe have
tremendous scope for innovation, but we will still be judged by
impact over time. Big, bold ideas still need to be strategized
against priorities, broken down, set in rapid motion, measure,
refine, repeat. Great execution is what separates promising
companies from proven unicorns.Be Customer ObsessedEveryone has the
responsibility to drive towards the best experience for the
customer, be an OSS user or a paid customer. If something is
broken, own it, say something, do something; never ignore. Be the
change that you want to see in the company.Pay Range
TransparencyOnehouse is committed to fair and equitable
compensation practices. Our job titles may span more than one
career level. The pay range(s) for this role is listed above and
represents the base salary range for non-commissionable roles or
on-target earnings for commissionable roles. Actual compensation
packages are dependent upon several factors that are unique to each
candidate, including but not limited to: job-related skills, depth
of transferable experience, relevant certifications and training,
business needs, market demands and specific work location. Based on
the factors above, Onehouse utilizes the full width of the range;
the base pay range is subject to change and may be modified in the
future. The total compensation package for this position will also
include eligibility for equity options and the benefits listed
above.
#J-18808-Ljbffr
Keywords: Onehouse, Carmichael , Senior Software Engineer, Reliability Engineering (US), IT / Software / Systems , Sunnyvale, California
Didn't find what you're looking for? Search again!
Loading more jobs...