Samba DNS issues resolved, finally!

I will preface this entire blog post with a disclaimer that before 2 months ago I had never worked with Samba before so take this information/advice with the appropriate grain(s) of salt. 

Samba Issue

The domain controller functionality seemed to be working correctly.  All our servers could authenticate users and resolve DNS entries.  But when you attempted to view the Forward Lookup Zone entries through the Windows DNS Manager or use the samba-tool command to query all DNS entries for a zone the Samba service would hit an exception and shut down.  This meant we couldn’t view or update DNS entries via the Windows DNS Manager.  Replication between our domain controllers was also unreliable, sometime it would work but sometimes not.

Here is an example of the samba-tool command that would cause the failure

$ samba-tool dns query dc1.mydomain.com mydomain.com @ ALL
ERROR(runtime): uncaught exception – (-1073741300, ‘The transport
connection is now disconnected.’)   File
“/usr/lib/python2.7/dist-packages/samba/netcmd/__init__.py”, line 175, in _run
     return self.run(*args, **kwargs)   File “/usr/lib/python2.7/dist-packages/samba/netcmd/dns.py”, line 994, in
run
     None, record_type, select_flags, None, None)

And this is the related error I found in syslog:

dnsserver: Invalid zone operation IsSigned

We saw the same errors when attempting to use the Windows DNS Manager.

Finding Help

Considering we had little experience with Samba and hoping to find an expert who could help us solve the problem, I first tried posting the issue on the Ubuntu Stack Exchange site;

https://askubuntu.com/questions/1022305/find-remove-bad-dns-entry-in-samba

I did get several comment responses but they weren’t helpful. 

At the time we weren’t exactly sure what the root cause was but the timing of the start of these errors did seem to correlate to the adding of a new DNS entry via the DNS Manager.  So we expected there might be a corrupt DNS entry.  Unfortunately, we didn’t document exactly what was added and we didn’t know how to find the most recently added entry.

Next we posted a job on Guru.com hoping to find a knowledgeable person who could help us. Two “gurus” took the job but neither were able to find the root cause or fix it.  Their best suggestion was to start over with a new Samba instance.  We agreed it was a hard problem but weren’t ready to go to with the nuclear option especially considering we had low confidence we could reproduce the setup in a good/working state.

Finally we looked to get help on the Samba mailing list:

https://www.samba.org/samba/archives.html

This ended up being the most helpful option with many knowledgeable Samba experts but in the end we weren’t able to resolve the issue with their help.  Although we did try many try many fixes.

https://lists.samba.org/archive/samba/2018-April/215161.html

Samba info

Through various research we found the following commands to be helpful in understanding the state and setup of our domain controllers:

smba-tool fsmo show

This was helpful for understanding which domain controller was master.

Logs and Configuration details

The following documentation was helpful in finding more details as well:

https://wiki.ubuntu.com/DebuggingSamba#samba-server

  • the content of the /etc/samba/smb.conf file
  • log files found in /var/log/samba/
  • the output of the smbclient -L //server/
  • the output of testparm –s

Troubleshooting

Upgrades

We tried upgrading to the latest version of Samba but this didn’t help either. Sow we reverted back to original version to avoid further unnecessary changes.

Replication

Through some divine intervention we randomly got all 3 of our domain controllers fully functional again, temporarily.  Not sure exactly how but we think forcing replication from a non-master DC down to the master removed the corrupt DNS entry that we had added.  The master DC started failing again a few hours later but thankfully we took snapshots of the VMs while they were all still working.

With snapshots of the DCs in a good working state, we now had a much easier way to test Samba. 

ldbsearch

Through some help on the Samba mailing list and more googling with Bing, we found Samba persists all of its data in a file database on the local machine.  Ours was located at /var/lib/samba/private/sam.ldb.  Further you can query the database using the ldbsearch command.  Using this we were able to query all DNS records in the database without error and compare the results across domain controllers.  Here is the ldbsearch command we used:

#sudo ldbsearch -H /var/lib/samba/private/sam.ldb -b “DC=DomainDnsZones,DC=acme,DC=com” “(objectclass=dnsNode)” –show-binary

The command outputs the following:


# record 7
dn: DC=server12.acme.com,DC=acme.com,CN=MicrosoftDNS,DC=DomainDnsZones,DC=acme,DC=com
objectClass: top
objectClass: dnsNode
instanceType: 4
whenCreated: 20170304040102.0Z
whenChanged: 20170304040102.0Z
uSNCreated: 45038
uSNChanged: 45038
showInAdvancedViewOnly: TRUE
name: server12.acme.com
objectGUID: 607689c8-3f08-48cc-82d6-2ecaa97d8481
dnsRecord:     NDR: struct dnsp_DnssrvRpcRecord
         wDataLength              : 0x0004 (4)
         wType                    : DNS_TYPE_A (1)
         version                  : 0x05 (5)
         rank                     : DNS_RANK_ZONE (240)
         flags                    : 0x0000 (0)
         dwSerial                 : 0x000001a4 (420)
         dwTtlSeconds             : 0x00000000 (0)
         dwReserved               : 0x00000000 (0)
         dwTimeStamp              : 0x0037aa4c (3648076)
         data                     : union dnsRecordData(case 1)
         ipv4                     : 11.45.2.212

objectCategory: CN=Dns-Node,CN=Schema,CN=Configuration,DC=acme,DC=com
dc: server12.acme.com
distinguishedName: DC=server12.acme.com,DC=acme.com,CN=MicrosoftDNS,DC=DomainDnsZones,DC=acme,DC=com

# returned 278 records
# 278 entries
# 0 referrals

So we wrote this output to a file, used scp to copy the file to our Windows machine and then the following C# code to parse, filter and sort the records:

var input = File.ReadAllText(@”C:\temp\dc1_dnsrecords_workingAfterChanges.2018.05.31.1018.txt”);

var rows = Regex.Matches(input, @”#[^#]*”);

rows.Cast<Match>().Select(x => new
{
     DnsType = Regex.Match(x.Value, @”(?<=DNS_TYPE_)\S+”).Value,
     Name = Regex.Match(x.Value, @”(?<=dc:\s)\S+”).Value,
     DistinguishedName = Regex.Match(x.Value, @”(?<=dn:\s)\S+”).Value,
     Ip = Regex.Match(x.Value, @”(?<=ipv4\s+:\s+)[\d.]*”).Value,
     //    RecordNumber = Regex.Match(x.Value, @”(?<=#\srecord\s)\d+”).Value,
     //Raw = x.Value,
})
//.Where(x => x.DistinguishedName.Contains(“server12”))
//.Where(x => x.Name.Contains(“Smith”))
.OrderBy(x => x.DnsType).OrderBy(x => x.Name)

Using this setup we could fairly easily compare DNS entries between different DCs and also single DCs in both a working and broken state. 

The Solution(s)

After comparing the differences between DCs in both working and broken states we narrowed our troubleshooting to the 3-4 DNS entries that were different. 

One issue we found was in some cases DNS entries for sub-domain fqdns we also needed corresponding CNAME entries.  Here is the related bug we found –

https://bugzilla.samba.org/show_bug.cgi?id=9409

Additionally we noticed some extra/wrong DNS entries for some of the domain controllers themselves so we fixed/removed those on all the DCs.  Its unclear if this was part of the solution or not.

Lastly, we found that we could restore the snapshot of the master DC and everything would work for about an hour and then we would get the error again.  So we got the DNS entry changes between the restored snapshot and an hour later when it broke.  The differences were new computers that had been added to the network recently.  We aren’t sure if these machines were notifying the DC themselves or if they were coming from replication of the other DCs.  Regardless, we initially found deleting the entries manually on the master DC fixed the error.  Further we could re-add the same exact DNS entry and everything would continue to work.  For example,

#samba-tool dns delete dc1 acme.com server12.acme.com A 11.60.31.70
#samba-tool dns add dc1 acme.com server12.acme.com A 11.60.31.70

After doing this for several new entries everything became stable.  We suspect that restoring snapshots left the DC in an odd state and deleting/re-adding the new entries fixed it. 

So in the end we believe three solutions compounded to produce the overall fix.

Lessons Learned

Its hard to really call this a success because I still don’t feel like we have a strong understanding of how to manage Samba well but at the very least we are far more knowledgeable and have a far better troubleshooting process.  Through this process we learned:

  • Samba is not as stable as Windows Server DC
  • The Samba mailing list has the most knowledgeable experts, get their help!
  • Keep recent snapshots of your DCs in good working order, these will be invaluable in troubleshooting
  • Verify replication is working correctly periodically

This has been one of the toughest technology problems I’ve ever faced and wouldn’t wish it on anyone, ever.  My hope is this post helps others fix Samba faster.  Please leave a comment below if you find this post helpful or you have further questions.  Not sure I will be able to help but will always try.

Cheers!

Leave a Reply