High Level Considerations
Hardware: Ways to build VSAN cluster
- VxRail : HCI Appliance
- VSAN Ready Nodes: Certified Hardware Form Factor ( Recommended )
Build your own based on certified components : Follow VMWare Compatibility Guide
- IO Controller
- Driver version
- Firmware version
- Run latest version of ESXi and vCenter. ( VMWare continuously fixes issues encountered by customers)
- Identical configuration across all cluster members.
Unbalanced configuration ( e.g. Some hosts not contribute storage to vsanDatastore, different IO Controllers, Disks)
- Additional support if problem is encountered
Design for growth
- Supports both Scale up and Scale out: Scale in a way there is adequate amount of capacity and cache for workloads
Design for growth
- Additional disk slots
- Oversize cache devices up front
Sizing : Capacity Maintenance and Availability
If there is a Failure (Device/Host) or Host in maintenance: VSAN attempts to rebuild components from failed device/host on the remaining cluster.
- 2-Node Cluster with Witness appliance: Rebuilding not possible; No Spare capacity available.
- 3-Node Cluster: Rebuilding not possible; No Spare capacity available
- 4-Node Cluster: Rebuilding is possible
Number of failures to Tolerate
- Consider space required for mirror copies
All Flash vs Hybrid
Cache size: 10% of anticipated consumed capacity (capacity before considering FTT) recommended for both AFD and Hybrid.
How Cache is used
- Hybrid: 70% for Reads, 30% for Writes
- AFD: 100% for Writes.
All Flash Considerations
- 1 Gb network not supported. Required 10 Gb
- Flash read cache reservation is not used with all flash configurations
- Need to mark a flash so that it can be used for capacity
- Endurance becomes important consideration for both cache and capacity layers.
- 2 ESXi hosts with witness appliance / 3 ESXi Hosts: In event of a failure, VSAN cannot rebuild components on another host
- 4+ ESXi Hosts: Recommended. VSAN can rebuild components on another host
- Maximum: 64 hosts. ( To run 64 nodes certain advanced settings must be set)
- Maximum VMs allowed: 200 VMs per ESXi & 6400 VMs per cluster ( VSAN 6.0)
- Maximum VMs protected by vSphere HA: 6400 VMs (VSAN 6.0). Earlier versions had limit of 2048 ( per Datastore limit)
Disks and Disk groups
- Disk group: Cache 1-1, Capacity 1-7
- Host: 1-5 Disk groups
- Caution: VSAN does not support mixing of all flash disk groups and hybrid disk groups in a cluster.
- Each stripe of an object is a component.
- Maximum per host: 9000
- Stretched cluster: 45000 components
- Largest component size: 255 GB
- Stripe Width per Object: 1-12.
- NumberOfFailuresToTolerate: 1-3. To accommodate ‘n’ failures there needs to be ‘2n+1’ fault domains/hosts in the cluster.
FlashReadCacheReservation: Maximum 100%
- Not applicable on AFD
- ObjectSpaceReservation: 100 %
- IOPLimitPerObject: 2147483647
- VSAN 6.0: 62TB ( Component size: 255GB)
- Consider enabling vSphere HA
- Ensure there are enough devices in capacity layer to accommodate a desired stripe width requirement
- Ensure there are enough hosts / fault domains to support FTT
- Consider component count
- Keep in mind that VMDKs, even 62TB VMDKs, will initially be thinly provisioned by default, so be prepared for future growth in capacity.
Network Design Considerations
- 1Gbps – Dedicated VSAN traffic
- 10 Gbps – If Shared with other types of traffic, use NIOC; Ensures no other traffic will impact vSAN traffic.
All Flash :
- 10 Gbps – If Shared with other types of traffic, use NIOC; Ensures no other traffic will impact vSAN traffic.
- Replication and Communication traffic between ESXi hosts
- # of Replicas per VM
- I/O intensive applications running in VM
- NIC teaming: Recommended Active/Standby
- If Jumbo frames are already enabled in the network infrastructure: Recommended
- If not enabled: Not Recommended (Operational cost outweigh limited CPU and Performance benefits)
Multicast / Unicast – depends on VSAN version
- Prior to VSAN 6.6: Multicast is required.
- VSAN 6.6 : VSAN traffic uses Unicast
- vSAN Network Design Guide
Storage Design Considerations
- Relevant on AFD & Hybrid Configurations
- Accelerates Performance.
- Leverages DRAM memory local to VM. The amount of RAM allocated it 4% ( Up to 1GB)
- Complementary to CBRC. CBRC is limited to read only replica. Client Cache enables caching of read only replica and VMDKs as well.
- CBRC allow common cached blocks to be served up to the virtual desktops in terms of microseconds, instead of millisecond
- Relevant on Hybrid Configurations
- VSAN divides up the caching of data blocks evenly between the replica copies.
- If the block being read from first replica is not in cache, the directory service is referenced to find if the block is in another cache
- If found, data is retrieved. If not found -> Read Miss; fetch from magnetic disc.
- Relevant on both all flash and Hybrid configuration.
- Once a write is initiated by the application running inside of the Guest OS, the write is duplicated to the write cache on the hosts which contain replica copies of the storage objects.
Flash Endurance Considerations
- Endurance specification to be used: Terrabytes Written (TBW)
- For both cache and capacity devices
Flash Cache Sizing
General Recommendation: 10% of the expected consumed storage capacity (for all VMs) before NFTT is considered.
- Note: If VM size is 100GB and expected usage is 20GB then 20GB is the expected consumed storage for the VM.
- If VM snapshots are used heavily increase Cache: Capacity ratio to 15%
- The objective is to keep the Active Working Set in cache as much as possible for best performance.
- FlashReadCacheReservation policy setting is only relevant on hybrid clusters
- Design for growth: Consider designing with a larger cache configuration that will allow for seamless future capacity growth
All Flash Configurations
- Prior to 6.5: 10% of the expected consumed storage capacity
- From 6.5: Performance based. Link: https://blogs.vmware.com/virtualblocks/2017/01/18/designing-vsan-disk-groups-cache-ratio-revisited/
Best practice: Check the VCG and ensure that the flash devices
- Provides the endurance characteristics that are required for the vSAN design.
Capacity Sizing Considerations
Common Considerations for both Hybrid & All Flash
- Number of VMs
Number of snapshots taken concurrently and Snapshot size
- Would snapshots capture VM Memory also? If yes, consider the space.
- Number of replica copies that will be created; NumberOfFailuresToTolerate
- Thin Provisioning Over Commitment ( Object Space Reservation)
Consideration Specific to All Flash
- Endurance and Performance becomes a consideration for capacity layer in all flash configuration
- NumberOfFailuresToTolerate: NFTT = Number of replicas created.
Formatting Overhead; all disks in disk groups are formatted with on-disk file system. Formatting consumes some space
- V1: 750MB per disk
- V2: 1% of physical disk capacity
- V3: 1% of physical disk capacity + deduplication metadata.
Checksum Overhead; 5 Bytes for every 4KB data
- Without deduplication: 0.12% of raw capacity
- With deduplication: 1.2%
Recommended free capacity (Slack Space): 30%. Design to Avoid Running out of Capacity
Capacity for failure: VSAN attempts to rebuild the missing/failed components on the remaining capacity in the cluster. Capacity device/ Cache device fails.
- Capacity Device Fails: Components get rebuild on same disk group or different disk group.
- Cache Device Fails: Components get rebuild
- VSAN begins automatic rebalancing when a disk reaches 80% full threshold.
Negligible Capacity Overheads
Component Overhead: Every component created consumes space for metadata
- VSAN 5.5: 2MB for component
- VSAN 6.0 v2: 4MB for component
- Witness Overhead: A witness is created for every component. Witness consumes 2MB of space (for metadata) on vSAN Datastore.
Scale Up Capacity
- Maintain required Cache: Capacity Ratio. Provide higher Cache: Capacity ratio initially
- New Disk Group: Scale up both Cache and Capacity
- Disk group assigns a cache device to provide cache for a group of capacity devices.
- If desired cache: capacity ratio is high -> multiple disk groups must be created because there can be only cache device per disk group.
Large disk group vs small disk groups
- Large disk group -> less cache: capacity ratio. Less cost
- Small disk groups -> more cache: capacity ratio. High cost.
Designing Disk group
- Diskgroup =~ Storage failure domain
Large Disk Groups vs Small Disk Groups: Recommended Multiple Small Disk groups.
- Large Disk Groups: When there is a failure, the length of time to rebuild components will be more.
- Small Disk Groups: Requires more flash devices, IO controllers, disk slots.
Drive Capacity, Component Size and VMDK Size
- Maximum component size on VSAN is 256 GB.
- A large VMDK object may be split into multiple components across multiple disks to accommodate large VMDK size. However when vSAN splits an object in this way, multiple components may reside on the same physical disk, a configuration that is not allowed when NumberOfDiskStripesPerObject is specified in the policy.
- Although vSAN might have the aggregate space available on the cluster to accommodate the large size VMDK object, it will depend on where this space is available. For example, in a 3 node cluster which has 200TB of free space, one could conceivably believe that this should accommodate a VMDK with 62TB that has a NumberOfFailuresToTolerate=1 (2 x 62TB = 124TB). However if one host has 100TB free, host two has 50TB free and host three has 50TB free, then this vSAN will not be able to accommodate this request.
PCIe Flash devices vs SSDs vs NVMe
Performance – Bandwidth
- SATA: 6Gbps ( Most SSDs use SATA interface)
- PCIe 3.x: 32 Gbps
- Performance – IOPS
- SSD : Largest 4000GB
- PCIe Flash: 6400 GB
- Considerations: If workload required PCIe performance or SSD would be sufficient
Factors to Consider
- Stripe Width
- SATA ( Capacity Centric Environments where performance is not a priority)
- SATA : 4TB
- SAS: 1.2 TB
- SAS – 15K RPM, NLSAS – 7200 RPM & SATA – 5400 RPM, 7200 RPM
- Cache friendly workloads are less sensitive to disk performance than cache unfriendly workloads
- Good practice: Be Conservative. ( Application performance profiles may change over time. 10K RPM are generally accepted drives)
- Number of Disks: Having smaller magnetic disks will often give good performance than larger ones.
- Uniform disk model across all nodes in cluster. Do not mix drive models/types.
Storage I/O Controller
- Ensure the components are in VCG
Single Controller vs Multiple Controller:
- # Of Disks per hosts & Ports supported by a Controller.
- Multiple IO controllers can reduce the failure domain. (blast radius)
- VMware has not extensively tested SAS expanders with vSAN, and thus does not encourage their use.
- SAS Expanders have been tested in limited cases with Ready Nodes on a case by case. Check VCG.
- Minimum recommended Controller queue depth : 256
- Recommended: Larger queue depth possible.
RAID 0 vs Pass-through: Recommended Pass-through.
- Pass-through means that this controller can work in a mode that will present the magnetic disks directly to the ESXi host.
- RAID 0 implies that each of the magnetic disks will have to be configured as a RAID 0 volume before the ESXi host can see them.
- Recommended: Pass-through. RAID-0 mode typically take longer to install and replace than pass-thru drives from an operations perspective
- Disable Cache on Controller, if possible.
- Advanced controlled features. Recommended to disable advanced features for acceleration in VSAN environment.
VM Storage Policy Design Considerations
Storage Policy Design Decisions
Number of Disk Stripes per Object / Strip Width
- Defines minimum number of capacity devices across which each replica of a storage object is distributed.
Improves VM Performance?: Yes and No; depends on application and the device
- Yes: If VMs are I/O sensitive and capacity devices ( VM data is distributed on) are not busy
- No: If VMs are I/O sensitive or capacity devices ( VM data is distributed on) are busy
Strip width sizing considerations
- Capacity devices: Are there enough devices in various hosts across the cluster to accommodate stripe width?
- Host Component Limit: Would Stripe width require significant number of components and impact/consume host component count?
Flash Read Cache Reservation(FRC) ( Relevant only in Hybrid Configurations)
- Recommendation: Default Value is 0%, don’t change unless a specific performance issue is observed.
- Flash Read Cache Reservation can easily exhaust Read Cache, especially if thin provisioning is used.
Number of Failures to Tolerate (FTT)
- For “n” failures tolerated, “n+1” copies of the object are created and “2n+1” hosts contributing storage are required
- Default: 1.
- Maximum: 3 ( if vmdk < 16tb), 1 ( vmdk > 16tb)
- Mirror copies consume space.
Fault Tolerance Method: RAID1, RAID5/6 (Erasure Coding)
- Provides significant capacity savings. Incurs additional overhead
- Available only in All Flash Configurations
- Allows violation of FTT, Stripe Width and FRCR during initial deployment of a VM.
Points to Consider
- Adding Resource to VSAN: Once additional resources become available in the cluster, vSAN may immediately consume these resources to try to satisfy the policy settings of virtual machines
- Data Migration: If an object is non-compliant then “Full Data Evacuation” of such an object behaves like “Ensure Accessibility”
- Object Space Reservation: Default 0% , Maximum 100%
IOP Limit per Object
- Prevents noisy neighbors
- Creates artificial standards of service as part of tired service offering using the same pool of resources.
- Enabled by default. Carries overhead of small disk I/O, CPU and Memory
- Can be disabled using DisableObjectChecksum
VM Name Space (VM Home Object) and Swap Considerations: They don’t inherit all settings from storage policy
VM Name Space
- Number of Disk Stripes Per Object: 1
- Flash Read Cache Reservation: 0%
- Number of Failures To Tolerate: (inherited from policy)
- Force Provisioning: (inherited from policy)
- Object Space Reservation: 0% (thin)
VM Swap Object: It is not visible in UI , use RVC Commands
- Number of Disk Stripes Per Object: 1 (i.e. no striping)
- Flash Read Cache Reservation: 0%
- Number of Failures To Tolerate: 1
- Force Provisioning: Enabled ( To disable use the setting SwapThickProvisionDisabled)
- Object Space Reservation: 100% (thick).
Snapshot Delta Disks
- Snapshot disks inherit the policy settings of VMDK
- Not visible in UI
- VSAN 5.5: Max 256 GB; Memory Snapshots are stored in VM Namespace and Maximum Namespace size is 256 GB
- VSAN 6.0 : No limits; Memory Snapshots are instantiated as objects
Changing a VM Storage Policy Dynamically
Changing policies dynamically may lead to a temporary increase in the amount of space consumed on the vSAN Datastore
- FTT Increased: New Replicas are created in addition to the existing Replicas.
- Stripe Width Increased: Existing Replicas cannot be used. Creates brand new replicas
Provisioning with a policy that cannot be implemented
- vSAN does not consolidate current configurations to accommodate newly deployed virtual machines
- Example: vSAN will not move components around hosts or disks groups to allow for the provisioning of a new replica, even though this might free enough space to allow the new virtual machine to be provisioned.
Provisioning with default policy: VMDK Thick provision / Thin Provision
- VSAN 5.5 : If no policy is selected while provisioning , Default policy uses Thick Provisioning
- VSAN 6.0: Default storage policy has all the capabilities
Host Design Considerations
- Sockets/Host, Core/Socket & vCPU/Core
- # of VMs & vCPUs/VM
- CPU Overhead for VSAN: 10%
- Desired Memory for VMs
- A minimum of 32GB is required per ESXi host for full VSAN functionality
VSAN Host Storage
- VMDKs : Storage required ( # of VMs, VMDKs size required for each VM)
- Memory Consumed by VMs ( .vswp )
- # of Snapshots per VM
- how long they are maintained
- Estimated space consumption for each snapshot
Boot Device Considerations
- VSAN 5.5: USB & SD
- VSAN 6.0: USB, SD and SATADOM
USB & SD : Logs and traces reside in RAM
- Redirect logs to persistent storage (not vsanDatastore)
- VMWare does not recommend storing logs and traces in VSAN Datasore
SATADOM : Traces reside in the SATADOM device
- Use SLC class device for performance and endurance.
- Compute Only Hosts: Not Recommended; Use balanced configurations
Maintenance Mode Considerations
- FTT: Enough hosts required to meet FTT?
- Stripe Width: # of capacity devices available on rest of the hosts to meet stripe width?
- Capacity for Data Migration: Enough capacity available on rest of the hosts?
- Flash capacity: Enough flash available to meet flash read cache reservations (Only in Hybrid)
Blade System Considerations: Not enough slots to scale local storage capacity.
- Consider External Storage Enclosures : Ensure VCG
- Processor Power Management Considerations: Avoid Extreme power-saving modes. Select ‘balanced’ mode.
Cluster Design Considerations
2/3 Node Configurations can tolerate only one failure
- Recommended 4+ node clusters
VSAN works with vSphere HA
- Host failure: HA restarts VMs
- Network Partition: HA understands VSAN objects and restart VM on a partition that still has access to a quorum of the VM components.
- HA must use VSAN network for communication
- HA does not use vsanDatastore as a datastore heart beating
- HA needs to be disabled before enabling VSAN. HA may only be enabled after VSAN is configured.
Additional capacity required to rebuild components
- VSAN does not interoperate with HA to ensure there is enough disk space available on remaining hosts in the cluster
Fault Domains: Rack Availability
- No two copies/replicas of the Virtual Machines data will be placed in the same fault domain
- Consider additional resource requirements to rebuild the components.
Requirement: Uniformly configured hosts (Balanced Configurations)
- Having unbalanced domains might mean that vSAN consumes the majority of space in one domain that has low capacity, and leaves stranded capacity in the domain that has larger capacity
Deduplication and Compression considerations
- Single feature ( For both Deduplication and Compression)
- When this feature is enabled, objects will not be deterministically assigned to a capacity device in a disk group, but will stripe across all disks in the disk group.
Determining if Workload is suitable for VSAN
Cache friendly applications
- If application is not cache friendly its performance depends on capacity devices
- VMWare View: View Planner for vSAN Sizing
- SDDC/VMWare Infrastructure: VMware Infrastructure Planner
No tags for this post.