Troubleshooting Common Challenges in Ansible – Part 4

In the previous blogs, Mastering Ansible for Server Management – Part 2 and Server Management using RedHat Ansible Automation Platform – Part 3, we discussed Ansible and Ansible Automation Platform, discovering its fundamental concepts and learning how to set up a robust Ansible project. In this blog we will discuss some of the gotchas and common issues faced with Ansible.  

To facilitate this discussion, we will look at how lead engineer Matt and his team implemented Ansible and the key lessons they learned along the way. 

After 3 months of Ansible implementation: 

Matt: Hey Sam, I would like to introduce you to Mark, a senior DevOps engineer from the Avengers team. They are also planning to implement Ansible and would like insights into our team’s experiences and the common issues we encountered. 

Sam: Sure, Matt. I have a list of issues we faced and a few gotchas we learned along the way. I will share them with Mark. Hey Mark, here is the list of issues and learnings: 

Learnings

Dynamic Inventories

Sam: We started with a static inventory file but very soon realised how difficult it was to manage them. They are fantastic, but if not managed properly, inconsistencies can creep in with servers coming and going. 

Mark: True, Sam. How did you tackle that? 

Sam: We had servers in both AWS and on-premises, and the dynamic inventory script fetched server details from the AWS test and prod environments. This meant we did not have to manually update the inventory file whenever a new server was added or decommissioned. 

Mark: Can you share an example of the script you used for updating the dynamic inventory? 

Sam: In our case, we utilised a Python script that interacted with the AWS CLI to fetch information about EC2 instances. This script, when executed, dynamically generated the inventory with the latest server details. Let me provide you with a simplified snippet: 

#!/usr/bin/env python 

 

import subprocess 

import json 

import yaml 

 

# Use AWS CLI to get EC2 instance details 

aws_command = aws ec2 describe-instances –query ‘Reservations[*].Instances[*].[InstanceId,PrivateIpAddress,Tags[?Key==`Name`].Value]’ –output json 

result = subprocess.run(aws_command, shell=True, stdout=subprocess.PIPE) 

instances_data = json.loads(result.stdout.decode()) 

# Define the file path for the dynamic inventory 

inventory_file_path = ./dynamic_inventory.yaml 

 

# Prepare the Ansible inventory 

inventory = { 

   _meta: { 

       hostvars: {} 

   }, 

   aws: { 

       hosts: [], 

       vars: { 

           ansible_connection: ssh, 

           ansible_user: ec2-user, 

           # Additional variables if needed 

       } 

   } 

} 

 

# Populate the inventory with EC2 instance details 

for instance in instances_data: 

   print (instance) 

   instance_id, private_ip, tags = instance[0] 

   inventory[_meta][hostvars][private_ip] = { 

       ansible_host: private_ip, 

       ansible_ssh_private_key_file: /path/to/your/private/key.pem, 

       # Additional host variables if needed 

   } 

   inventory[aws][hosts].append(private_ip) 

 

# Write the dynamic inventory to a YAML file 

with open(inventory_file_path, w) as inventory_file: 

   yaml.dump(inventory, inventory_file, default_flow_style=False) 

 

print(f“Dynamic inventory has been written to {inventory_file_path}) 

 

 

Mark: Thanks, Sam. How did you dynamically manage the on-prem inventory? 

Sam: Currently, we do not have dynamic management for on-premises servers; it is a manual process. However, since we use VMware, which offers an API, we are exploring plans to leverage it for automation and dynamic inventory management in the future. 

One more important thing to remember: 

Regularly test and update your dynamic inventory scripts. Think of them like well-tended gardens – if you let them go wild, you might find surprises.” 

Ansible Vault

Sam: Managing sensitive data was another headache initially.  

Ansible Vault is our guardian here. Encrypt sensitive data and keep it out of plain sight. Or you can also consider external systems for secrets, like HashiCorp Vault or AWS Secrets Manager. 

In our case, we created encrypted files using Ansible Vault for storing sensitive data like passwords and API keys.  

When running a playbook, Ansible would prompt for the Vault password, ensuring that only authorised personnel could access the encrypted data. 

For testing purposes, you can follow below approach and then later remove vault password file. 

 

 

# create vault password file and enter password 

openssl rand -base64 16 > vault-pwd 

# Use this secret to encrypt a user credentials password

ansible-vault encrypt_string –vault-password-file vault-pwd sensitive_value –name admin_password 

Use the output of above command as an encrypted value in the variable file.

 

To use these credentials, you must pass the vault password in the command.

 

# to use these credentials, pass vault password in ansible-playbook command 

ansible-playbook config/configure_ad_join.ymli inventory_production  -e target_host=server1 –vault-password-file=vault-pwd -v 

In Ansible Automation Platform (AAP), managing credentials is simplified using the ‘Credentials’ feature, allowing seamless integration into playbook templates. 

Use these credentials along with server credentials in your playbook.

Mark: Yes, it is much easier to manage secure credentials in AAP. Also, did you encounter any challenges with this approach? 

Sam: Initially, yes. Managing and rotating Vault passwords among team members was a bit tricky. We eventually set up a process to periodically update the Vault passwords, ensuring that only the necessary people had access. Also, it is crucial to document the process well, so new team members can quickly get up to speed. 

Mark: That makes sense. We will need to plan for proper password management from the beginning. 

Sam: Absolutely, Mark. It saves you headaches down the road. 

Network Delays

Sam: Sometimes tasks used to fail due to network delays. 

For tasks sensitive to network delays, we experimented with adjusting the timeout and retries parameters in Ansible. For example, we had a playbook where fetching data from an external API sometimes took longer than usual. We set a higher timeout value and increased the number of retries, which significantly improved the playbook’s reliability. 

Below are some of the examples of different scenarios we handled 

  • Adjusting Timeout Values:  
    Ansible tasks have a timeout parameter that specifies the maximum amount of time a task is allowed to run. By default, this value is set to 10 seconds. If your tasks involve operations that might take longer, consider increasing the timeout value. 

 

name: Task with a longer timeout 

 command: some_command 

 timeout: 60 # Set the timeout to 60 seconds 

  • Retrying Failed Tasks:
    Ansible allows you to set the number of retries for a task using the retries and delay parameters. This is useful when intermittent network issues may cause a task to fail.

 

name: Task with retries 

 command: some_command 

 retries: 3   # Retry the task up to 3 times 

 delay: 10     # Wait for 10 seconds between retries 

  • Asynchronous Tasks:
    For long-running tasks, consider using Ansible’s asynchronous execution. This allows Ansible to launch a task and continue playbook execution without waiting for the task to complete. You can later poll for the task status. 

 

name: Asynchronous task 

 command: some_long_running_command 

 async: 300     # Set the maximum runtime for the task (in seconds) 

 poll: 0        # Do not wait for the task to complete 

 register: task_result 

 

name: Wait for the asynchronous task to complete 

 async_status: 

   jid: {{ task_result.ansible_job_id }} 

 register: job_result 

 until: job_result.finished 

 retries: 30     # Retry up to 30 times with a delay between retries 

 delay: 10       # Wait for 10 seconds between retries 

Matt: It is essential to fine-tune these parameters based on your infrastructure and the nature of your tasks. What worked for us might need adjustments for your environment. 

Module Compatibility

Sam: Keep in mind that Ansible modules may not seamlessly align with every operating system. For instance, in our case, we managed Linux servers running CentOS and Ubuntu. Given that Ubuntu relies on apt and CentOS on yum as the package manager, we opted for the more versatile package module.  

Example: Let us consider a scenario where you want to install the firewalld package.

Avoid OS-Specific Module Calls: 

name: Install firewalld on CentOS 

 yum: 

   name: firewalld 

   state: present 

 when: ansible_os_family == ‘RedHat’ 

 

name: Install firewalld on Ubuntu 

 apt: 

   name: firewalld 

   state: present 

 when: ansible_os_family == ‘Debian’ 

 

Use the Generic Package Module: 

name: Install firewalld using the package module 

 package: 

   name: firewalld 

   state: present 

In this example, Ansible intelligently detects the operating system and installs the necessary package, accordingly, eliminating the need to specify apt or yum explicitly. It simplifies your playbook, making it cleaner and more maintainable.  

This approach is not limited to package management; Ansible provides similar abstraction for various tasks, like service management, file manipulation, and more. 

Troubleshooting

Mark: What are some of the common troubleshooting scenarios you can share? 

Sam: Certainly, Mark. Here are some common troubleshooting scenarios we faced: 

  • SSH Connection Issues: 
    Symptom: Playbook fails with SSH-related errors. 
    Solution: Verify SSH connectivity between the control node and managed nodes. Check SSH configuration, keys, and user permissions. Use the -vvv option with ansible-playbook to get detailed SSH debugging information. 
  • Inventory File Errors: 
    Symptom: Playbook fails due to issues with the inventory file. 
    Solution: Check the syntax of the inventory file. Ensure that hostnames, IP addresses, and groups are defined correctly. Run ansible-inventory –list to validate the parsed inventory data.
    Case sensitivity in Inventory file: 
    The names of hosts and groups in the inventory file are case-sensitive. Ensure consistency in naming to avoid issues with host identification.
  • Playbook Syntax Errors: 
    Symptom: Playbook execution fails with syntax errors. 
    Solution: Run ansible-playbook playbook.yml –syntax-check to validate the playbook syntax. Check for indentation errors, missing colons, or other syntax issues.
  • Variable Scope Issues: 
    Symptom: Variables are not being recognised or have unexpected values.
    Solution: Review variable scopes. Ensure that variables are defined in the correct context (e.g., playbook, role, or task level). Use the debug module to print variable values during playbook execution.  

name: Debug Information 

 debug: 

   var: some_variable 

  • Permissions and Privilege Escalation: 
    Symptom: Playbook fails due to insufficient permissions or privilege escalation issues. 
    Solution: Confirm that the user running Ansible has the necessary permissions on both the control node and managed nodes. Check sudo configurations if privilege escalation is required. 
  • Package Manager Variations: 
    Symptom: Playbook fails due to package manager differences on different operating systems. 
    Solution: Use the generic package module instead of OS-specific package manager modules. Ansible will intelligently choose the correct package manager based on the detected operating system. 
  • Ansible Version Mismatch: 
    Symptom: Playbook fails or behaves unexpectedly due to version mismatches.
    Solution: Ensure that you are using compatible Ansible versions on the control node and managed nodes. Update Ansible to the latest version to benefit from bug fixes and new features. 
  • Firewall Issues:
    Symptom: Playbook fails due to firewall restrictions.
    Solution: Check if firewalls on the control node, managed nodes, or network devices are blocking Ansible communication. Ensure that the necessary ports (usually SSH port 22) are open. 

Ansible Control Node

Sam: The Ansible control node, the system where Ansible is installed, must run on Linux or MacOS; it cannot be a Windows operating system. You can find more details on control node requirements on Ansible’s official documentation 

Despite this restriction, Ansible remains highly versatile, as it can effectively manage Windows servers. The reason for this difference in operating system requirements is due to the underlying design and dependencies of Ansible, which align more seamlessly with Unix-like environments.  

However, to facilitate cross-platform automation, Ansible incorporates specialised modules and functionalities that enable efficient management of Windows servers, allowing organisations to employ Ansible’s automation capabilities across a diverse range of operating systems. 

Mark: Solid advice. Thanks for sharing your experiences, Sam. We will keep these lessons in mind as we dive into Ansible. 

Matt: You’re welcome, Mark. If you have any more questions as you get started, feel free to reach out. Good luck with your Ansible journey! 

Mark: Thanks, Matt and Sam. This gives us a great starting point for our Ansible implementation.  

Conclusion

In conclusion, as you embark on your Ansible journey, remember to stay vigilant with regular testing, updates, and collaboration. Leverage the robust Ansible community, where experiences are shared, and solutions abound.  

As we have explored, the journey involves considerations ranging from optimising dynamic inventories and securing sensitive data with Ansible Vault to addressing network delays and fine-tuning playbook parameters. Yet, despite these potential obstacles, Ansible stands as a powerful ally in the realm of infrastructure automation. Its ability to transcend operating system boundaries, seamlessly managing both Unix-like and Windows environments, underscores its versatility. 

While the implementation of Ansible promises streamlined automation, it is essential to be mindful of common gotchas. With these considerations in mind, Ansible becomes not just a tool but a reliable companion in the pursuit of efficient and scalable infrastructure management.  

Happy automating! 

Enjoyed this blog?

Share it with your network!

Move faster with confidence